Is this you? Create Your Porfile

Byeong Kil Lee

University of Texas at San Antonio

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Byeong Kil Lee is active.

Explore More

Publication

Featured researches published by Byeong Kil Lee.

international conference on computer design | 2003

NpBench: a benchmark suite for control plane and data plane applications for network processors

Byeong Kil Lee; Lizy Kurian John

Modern network interfaces demand highly intelligent traffic management in addition to the basic requirement of wire speed packet forwarding. Several vendors are releasing network processors in order to handle these demands. Network workloads can be classified into data plane and control plane workloads, however most network processors are optimized for data plane. Also, existing benchmark suites for network processors primarily contain data plane workloads, which perform packet processing for a forwarding function. We present a set of benchmarks, called NpBench, targeted towards control plane (e.g., traffic management, quality of service, etc.) as well as data plane workloads. The characteristics of NpBench workloads, such as instruction mix, parallelism, cache behavior and required processing capability per packet, are presented and compared with CommBench, an existing network processor benchmark suite [T. Wolf et. al., (2000)]. We also discuss the architectural characteristics of the benchmarks having control plane functions, their implications to designing network processors and the significance of instruction level parallelism (ILP) in network processors.

international conference on information technology: new generations | 2010

Towards Smaller-Sized Cache for Mobile Processors Using Shared Set-Associativity

Naveen Davanam; Byeong Kil Lee

As multi-core trends are becoming dominant, cache structures are complicated and bigger shared level-2 caches are demanded. Also, in mobile processors, multi-core design is being applied. To achieve higher cache performance, lower power consumption and smaller chip area in multi-core mobile processors, cache configuration should be re-organized and re-analyzed. The MID (Mobile Internet Devices) which are embedding mobile processors are becoming one of major platforms and demanding to have a capability to run more general-purpose workload in new platforms (eg., Netbook). In this paper, we proposed a novel cache mechanism to provide performance improvement without increasing cache memory size. Most of applications (workloads) have spatial locality in cache behaviors which means small boundary of cache locations tend to be used in a given piece of time. Considering this concept of locality reversely, logically farthest sets will have relatively lower correlation in terms of locality. The possibility that these two sets are used in same basic block would be very low. With this observation, we investigate the feasibility of sharing two sets of cache blocks for data fill and replacement within a cache. By sharing the sets, certain amount of acceptable performance improvement could be expected without increasing cache size. Based on our simulation with sampled SPEC CPU2000 workloads, the proposed cache mechanism shows average reduction in cache miss rate up to 8.5% (depending on cache size and baseline set associativity), compared to the baseline cache.

international soc design conference | 2010

Effective workload reduction for early-stage power estimation

Satish Raghunath; Byeong Kil Lee

In todays information technology trends, all kinds of digital and multimedia technologies are converging into single mobile internet devices (MID). This digital convergence trend is being accelerated by deep sub-micron technology and the concept of system-on-chip design. In SoC design, reconfigurable soft-IPs are widely used than the hard-IPs to obtain more optimized design toward a system or platform. To decide optimized configuration of the soft-IPs, early-stage design exploration is required. Also, choosing appropriate workloads and workload reduction methodology are very crucial for accurate and fast estimation in design exploration. In this paper, we propose a methodology to reduce the amount of workloads used for early-stage power estimation. We explore two scenarios in our analysis for effective workload reduction: (i) instruction-distribution-based workload reduction; (ii) demand-based workload reduction. Based on our experiment, power estimation with the reduced-workload shows more accurate and faster than with conventional reduction method (Simpoint). We conclude that workload reduction technologies which are customized for demanded performance metric are highly required for effective and faster performance evaluation at each design stage — especially in SoC design.

IEEE Transactions on Very Large Scale Integration Systems | 2006

Architectural enhancements for network congestion control applications

Byeong Kil Lee; Lizy Kurian John; Eugene John

Complex network protocols and various network services require significant processing capability for modern network applications. One of the important features in modern networks is differentiated service. Along with differentiated service, rapidly changing network environments result in congestion problems. In this paper, we analyze the characteristics of representative congestion control applications-scheduling and queue management algorithms, and we propose application-specific acceleration techniques that use instruction-level parallelism (ILP) and packet-level parallelism (PLP) in these applications. From the PLP perspective, we propose a hardware acceleration model based on detailed analysis of congestion control applications. In order to get large throughputs, a large number of processing elements (PEs) and a parallel comparator are designed. Such hardware accelerators provide large parallelism proportional to the number of processing elements added. A 32-PE enhancement yields 24times speedup for weighted fair queueing (WFQ) and 27times speedup for random early detection (RED). For ILP, new instruction set extensions for fast conditional operations are applied for congestion control applications. Based on our experiments, proposed architectural extensions show 10%-12% improvement in performance for instruction set enhancements. As the performance of general-purpose processors rapidly increases, defining architectural extensions (e.g., multi-media extensions (MMX) as in multimedia applications) for general-purpose processors could be an alternative solution for a wide range of network applications

international performance computing and communications conference | 2011

CUDA acceleration of P7Viterbi algorithm in HMMER 3.0

Saddam Quirem; Fahian Ahmed; Byeong Kil Lee

Dynamic programming matrices and the P7Viterbi algorithm of HMMER 3.0 show high parallelism in its code. Within the code, every query can have its score calculated in parallel with one thread per query. In this paper, these parallel features were exploited through the use of CUDA and a GPGPU. The CUDA implementation of this algorithm being performed on the Tesla C1060 enabled a 10–15x speedup depending on the number of queries. Without concurrent kernel execution and memory transfers a speedup of over 4x was achieved in terms of the total execution time. With a wide range of data sizes where the CPU has greater performance, it would be important that CUDA enabled programs properly select when to and not utilize the GPU for acceleration.

application-specific systems, architectures, and processors | 2002

Implications of programmable general purpose processors for compression/encryption applications

Byeong Kil Lee; Lizy Kurian John

With the growth of the Internet and mobile communication industry, multimedia applications form a dominant computer workload. Media workloads are typically executed on Application Specific Integrated Circuits (ASICs), application specific processors (ASPs) or general purpose processors (GPPs). GPPs are flexible and allow changes in the applications and algorithms better than ASICs and ASPs. However, executing these applications on GPPs is done at a high cost. In this paper, we analyze media compression/decompression algorithms from the perspective of the overhead of executing them on a programmable general purpose processor versus ASPs. We choose nine encode/decode programs from audio, image/video andencryption applications. The instruction mix, memory access and parallelism aspects during the execution of these programs are analyzed. Memory access latency is observed to be the main factor influencing the execution time on general purpose processors. Most of these compression/decompression algorithms involve processing the data through execution phases (e.g. quantization, encoding, etc) and temporary results are stored and retrieved between these phases. A metric called overhead memory-access bandwidth per input/output byte is defined to characterize the temporary memory activity of each application. We observe that more than 90% of the memory accesses made by these programs are temporary data stores and loads arising from the general purpose nature of the execution platform. We also study the data parallelism in these applications, indicating the ability of instruction level and data level parallel processors to exploit the parallelism in these applications. The parallelism ranges from 6 to 529 in encode processes and 18 to 558 in decode processes.

international performance computing and communications conference | 2012

Fixed Segmented LRU cache replacement scheme with selective caching

Kathlene Morales; Byeong Kil Lee

Cache replacement policies are an essential part of the memory hierarchy used to bridge the gap in speed between CPU and memory. Most of the cache replacement algorithms that can perform significantly better than LRU (Least Recently Used) replacement policy come at the cost of large hardware requirements [1][3]. With the rise of mobile computing and system-on-chip technology, these hardware costs are not acceptable. The goal of this research is to design a low cost cache replacement algorithm that achieves comparable performance to existing scheme. In this paper, we propose two enhancements to the SLRU (Segmented LRU) algorithm: (i) fixing the number of protected and probationary segments based on effective segmentation ratio with increasing the protected segments, and (ii) implementing selective caching, to achieve more effective eviction, based on preventing dead blocks from entering the cache. Our experiment results show that we achieve a speed up to 14.0% over LRU and up to 12.5% over standard SLRU.

modeling, analysis, and simulation on computer and telecommunication systems | 2010

Composite Pseudo-Associative Cache for Mobile Processors

Lakshmi Deepika Bobbala; Javier Salvatierra; Byeong Kil Lee

Multi-core trends are becoming dominant, creating sophisticated and complicated cache structures. Also, the bigger shared level-2 (L2) caches are demanded for higher cache performance. One of the easiest ways to design cache memory for increased performance is to double the cache size. However, the big cache size is directly related to the area and power consumption. Especially in mobile processors, simple increase of the cache size may significantly affect its chip area and power. In this paper, we propose a composite cache mechanism for L2 cache to maximize cache performance within a given cache size. This technique can be used without increasing cache size and set associativity by emphasizing primary way utilization and pseudo-associativity. Based on our experiments with the sampled SPEC CPU2000 workload, the proposed cache mechanism shows the remarkable reduction in cache misses. The variation of performance improvement depends on cache size and set associativity, but the proposed scheme shows more sensitivity to cache size increase than set associativity increase.

IEEE Transactions on Computers | 2005

Implications of executing compression and encryption applications on general purpose processors

Byeong Kil Lee; Lizy Kurian John

Compression and encryption applications are important components of modern multimedia workloads. These workloads are typically executed on application specific integrated circuits (ASICs), application specific processors (ASPs), or general purpose processors (GPPs). GPPs are flexible and allow changes in the applications and algorithms better than ASICs and ASPs. However, executing these applications on GPPs is done at a high cost. In this paper, we analyze media compression and encryption applications from the perspective of executing them on a programmable general purpose processor versus ASPs. We select 12 programs from various types of compression and encryption applications. The instruction mix, data types of memory operations, and memory access workloads during the execution of these programs are analyzed. Most of these applications involve processing the data through execution phases (e.g., quantization, encoding, etc.) and temporary results are stored and retrieved between these phases. A metric called overhead memory-access bandwidth per input and output byte is defined to characterize the temporary storage and retrieval of data during execution. We observe that more than 90 percent of the memory accesses made by these programs are temporary data stores and loads arising from the general purpose nature of the execution platform. We also verify the robustness of the proposed metric on different experimental environments using various input data properties and compiler optimizations.

international conference on information technology: new generations | 2011

Hybrid-way Cache for Mobile Processors

Bobbala Lakshmi Deepika; Byeong Kil Lee

As multi-core trends are becoming dominant, cache structures are being sophisticated and complicated. Also, the bigger shared level-2 (L2) caches are demanded for higher cache performance. However, the big cache size is directly related to the area and power consumption. Designing a cache memory, one of the easiest ways to increase the performance is doubling the cache size. In mobile processors, however, simple increase of the cache size may significantly affect its chip area and power. To address this issue, in this paper, we propose the hy-way cache (hybrid-way cache) which is a composite cache mechanism to maximize cache performance within a given cache size. This mechanism can improve cache performance without increasing cache size and set associativity by emphasizing the utilization of primary way(s) and pseudo-associativity. Based on our experiments with the sampled SPEC CPU2000 workload, the proposed cache mechanism shows the remarkable reduction in cache misses with the penalty of additional hardware cost and additional power consumption. The variation of performance improvement depends on cache size and set associativity, but the proposed scheme shows more sensitivity to cache size increase than set associativity increase.

Explore More