Yong-Surk Lee
Yonsei University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Yong-Surk Lee.
IEEE Transactions on Consumer Electronics | 2001
Sangook Moon; Jaemin Park; Yong-Surk Lee
We propose new methods for calculating fast VLSI arithmetic algorithms for secure data encryption and decryption in the elliptic curve cryptosystem (ECC), and also verify the proof-of-concepts by numerical expressions and through the use of HDL (hardware description language). We have developed a fast finite field multiplier that utilizes a new concept, and a finite field divider with an improved internal structure, as well as a novel fast algorithm for calculating kP, which is the most time-consuming operation in the ECC data encryption scheme. The proposed multiplier features a higher throughput per cost ratio than any other existing Galois field (GF) multiplier that can be used in the large prime finite field. Furthermore, our improved divider shows better extensibility. The developed algorithm for point multiplication decreases the steps required for iteration by half compared to that of the traditional double-and-add algorithm. It also reduces the number of field multiplications by about 19% and that of field divisions by about 9%.
Energy Procedia | 2004
Jongsu Park; San Kim; Yong-Surk Lee
The Booth algorithm has a characteristic that the Booth algorithm produces the Booth encoded products with a value of zero when input data stream have sequentially equal values. Therefore, partial products have greater chances of being zero when the one with a smaller dynamic range of two inputs is used as a multiplier. To minimize greater switching activities of partial products, we propose a novel multiplication algorithm and its associated architecture. The proposed algorithm divides a multiplication expression into four multiplication expressions, and each multiplication is computed independently. Finally, the results of each multiplication are added. Therefore, the exchanging rate of two input data calculations can be higher during multiplication. Implementation results show the proposed multiplier can maximally save about 20% in terms of power dissipation than the previous Booth multiplier.
international semiconductor conference | 1997
Yonghwan Lee; Wookyung Jeong; Sangjun Ahn; Yong-Surk Lee
In this paper, we propose a shared tag memory through which both TLB and cache memory can be accessed. The shared tag architecture reduces the area of conventional cache tag memory and also improves the speed of cache system. To validate the proposed architecture, we conducted trace-driven simulations and measured the area and speed based on VLSI circuits.
international conference on asic | 2003
Seok-Won Heo; Moon-Gyung Kim; Yong-Surk Lee
In this paper, we propose a benchmark for an optimized adder selection. Adders can divide data into small groups which are interconnected by carry propagate units. These adders were synthesized with a Samsung 0.35 (mm), 3.3 V CMOS standard cell library while using design compiler. The CLA with which small groups were synthesized with ungrouping is the fastest adder of all adders. It can operate at 289 MHz. The RCA with which all small groups were synthesized with grouping is the smallest adder of all adders. It has about 748.6 gates. The optimized adder for a crypto-processor is that of a 64-bit RCA based on 16-bit CLA. All small adder groups in this adder were synthesized with grouping. The adder can operate at a clock speed of 198.0 and has about 966.6. All adders can execute operations in the worst case conditions of 2.7 (V), 85 (dC).In this paper, we propose a benchmark for an optimized adder selection. Adders can divide data into small groups which are interconnected by carry propagate units. These adders were synthesized with a Samsung 0.35 (μm), 3.3 V CMOS standard cell library while using design compiler. The CLA with which small groups were synthesized with ungrouping is the fastest adder of all adders. It can operate at 289 MHz. The RCA with which all small groups were synthesized with grouping is the smallest adder of all adders. It has about 748.6 gates. The optimized adder for a crypto-processor is that of a 64-bit RCA based on 16-bit CLA. All small adder groups in this adder were synthesized with grouping. The adder can operate at a clock speed of 198.0 and has about 966.6. All adders can execute operations in the worst case conditions of 2.7 (V), 85 (°C).
Operative Techniques in General Surgery | 2002
Seungchul Kim; Yong-Joo Lee; Wookyeong Jeong; Yong-Surk Lee
In this paper, we present the FP-AU (Floating Point Arithmetic Unit) that improves the performance of multimedia processing with low hardware cost. The hardware cost was minimized by designing a 32-bit single precision architecture that can execute 64-bit double precision operations with very low hardware overhead, such as a barrel shifter and control logics. Since the FP-AU occupies a significant amount of silicon area in a microprocessor due to double precision data, our proposed architecture shows a very efficient performance/cost ratio. The sticky bit generation logic offers a simple architecture that can reduce hardware cost. The FP-AU was modeled in Verilog HDL and synthesized with 0.35 /spl mu/m standard cell libraries after verification. The occupied area is about 6,590 equivalent gates. It operates at 130 MHz clock speed under worst-case conditions.
international conference on vlsi and cad | 1999
Wookyeong Jeong; Sangjun An; Moon-Gyung Kim; Sangkyong Heo; Youngjun Kim; Sangook Moon; Yong-Surk Lee
In this paper, a combined architecture, YS-RDSP, which merges a RISC microprocessor with a DSP processor to be suitable for embedded applications is proposed and designed. The YS-RDSP can execute maximum 4 instructions in parallel at the same time. In order to reduce the size of programs, the YS-RDSP has variable instruction length of 16-bit and 32-bit. The YS-RDSP provides DSP processing power as well as control power and programmability of RISC microprocessor on a single chip. The YS-RDSP has 8-kbyte ROM and 8-kbyte RAM on chip. System controller which is a peripheral included in the chip provides three power-down modes for low-power operations, and SLEEP instruction switches the operation states of the CPU core and peripherals. The YS-RDSP processor is modeled in Verilog-HDL with top-down design methodology. Verified model is synthesized with 0.6 /spl mu/m 3.3 V CMOS standard cell library and laid out using automated P&R resulting 10.7 mm by 8.4 mm core area.
international conference on consumer electronics | 2005
In-Pyo Hong; Yong-Joo Lee; Sung-Jae Chun; Yong-Surk Lee; Jinoo Joung
This work suggests a multi-threading processor architecture for a wireless LAN MAC protocol controller. By adopting a multi-threading technique, not only can overall throughput be increased but also context switching time can be eliminated. In the wireless communication environment, wherein tiny events must be handled simultaneously, the multi-threading architecture can be effective. The proposed architecture is designed to be simple so that its design complexity is not significantly increased compared with conventional embedded processors. Thus, it can be used as a MAC layer controller, targeting mobile wireless LAN products.
The Journal of Korean Institute of Communications and Information Sciences | 2011
Jaewon Park; Won-Young Chung; Hyun-Pil Kim; Jung-Hee Lee; Yong-Surk Lee
The mount of network traffic from the Internet is increasing because of the use of Broadband Convergence Networks(BcN). Network traffic is also increasing because of the development of application, especially multimedia traffic from IPTV, VOD, and online games. This multimedia traffic not only has a huge payload but also should be considered a threat in real time. For this reason, this study examines the ways that routers distribute the bandwidth in accordance to traffic properties. To classify the property of the traffic, it is essential to analyze the application layer. However, the general network processor architecture serially processes the L2-4 and L7 layer. We propose a novel parallel network processor architecture with a global cache that processes L2-4 and L7 in parallel. To verify the proposed architecture, we simulated both of the architecture with SystemC. EEMBC and SNORT was used to measure L2-4 and L7 processing time. When multimedia traffic was entered into the network processor in the same flow, the proposed architecture showed about 85% higher performance than general architecture.
IEICE Electronics Express | 2009
Ha-young Jeong; Won Hur; Yong-Surk Lee
In this paper, we propose a scalable distributed memory system with a low-cost hardware message-passing interface. The proposed interface improves the communication performance between nodes to decrease the overhead synchronization with a receiver reservation technique. The simulation results indicate that the performance is increased by 20% on 4x4 communications. The synthesis result of the proposed MPI indicates that the area was only 4.49% of each computing node. As a result, the proposed system is a useful embedded MPSoCs (Multiprocessor System on a Chip) for its low-cost implementation and scalability.
international conference on asic | 2003
Chang-Yong Heo; Kyu-Baik Choi; In-Pyo Hong; Yong-Surk Lee
A SMT architecture uses TLP (Thread Level Parallelism) and increases processor throughput, such that issue slots can be filled with instructions from multiple independent threads. Having multiple ready threads reduces the probability that a functional unit is left idle, which increases processor efficiency. To utilize those advantages for the SMT processor, the issue unit must control the flow of instructions from different threads and not create conflicts among those instructions, which make the SMT issue logic extremely complex. Therefore, our SMT architecture, which is modeled in this paper, uses an in-order-issue and completion scheme. The SMT architecture which has an in-order-issue scheme can use a simple issue mechanism with a scoreboard array instead of using register renaming or a reorder buffer. However, a SMT scoreboarding mechanism is still more complex and costlier than that of a single threaded conventional processor. This paper presents a simple but effective implementation of a scoreboard array for the ARM-based SMT processor.A SMT architecture uses TLP (Thread Level Parallelism) and increases processor throughput, such that issue slots can be filled with instructions from multiple independent threads. Having multiple ready threads reduces the probability that a functional unit is left idle, which increases processor efficiency. To utilize those advantages for the SMT processor, the issue unit must control the flow of instructions from different threads and not create conflicts among those instructions, which make the SMT issue logic extremely complex. Therefore, our SMT architecture, which is modeled in this paper, uses an in-order-issue and completion scheme. The SMT architecture which has an in-order-issue scheme can use a simple issue mechanism with a scoreboard array instead of using register renaming or a reorder buffer. However, a SMT scoreboarding mechanism is still more complex and costlier than that of a single threaded conventional processor. This paper presents a simple but effective implementation of a scoreboard array for the ARM-based SMT processor.