Daeseob Lim | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daeseob Lim is active.

Explore More

Publication

Featured researches published by Daeseob Lim.

acm sigplan symposium on principles and practice of parallel programming | 2005

Adaptive execution techniques for SMT multiprocessor architectures

Changhee Jung; Daeseob Lim; Jaejin Lee; SangYong Han

In simultaneous multithreading (SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize performance in an SMT multiprocessor, finding the optimal number of threads is important. This paper presents adaptive execution techniques to find the optimal execution mode for SMT multiprocessor architectures. A compiler preprocessor generates code that, based on dynamic feedback, automatically determines at run time the optimal number of threads for each parallel loop in the application. Using 10 standard numerical applications and running them with our techniques on an Intel 4-processor Hyper-Threading Xeon SMP with 8 logical processors, our code is, on average, about 2 and 18 times faster than the original code executed on 4 and 8 logical processors, respectively.

international parallel and distributed processing symposium | 2006

Helper thread prefetching for loosely-coupled multiprocessor systems

Changhee Jung; Daeseob Lim; Jaejin Lee; Yan Solihin

This paper presents a helper thread prefetching scheme that is designed to work on loosely-coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely-coupled processors have an advantage in that fine-grain resources, such as processor and L1 cache resources, are not contended by the application and helper threads, hence preserving the speed of the application. However, inter-processor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely-coupled system can be done effectively, we evaluate our prefetching in a standard, unmodified CMP system, and in an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33.

IEEE Transactions on Parallel and Distributed Systems | 2009

Prefetching with Helper Threads for Loosely Coupled Multiprocessor Systems

Jaejin Lee; Changhee Jung; Daeseob Lim; Yan Solihin

This paper presents a helper thread prefetching scheme that is designed to work on loosely coupled processors, such as in a standard chip multiprocessor (CMP) system or an intelligent memory system. Loosely coupled processors have an advantage in that resources such as processor and L1 cache resources are not contended by the application and helper threads, hence preserving the speed of the application. However, interprocessor communication is expensive in such a system. We present techniques to alleviate this. Our approach exploits large loop-based code regions and is based on a new synchronization mechanism between the application and helper threads. This mechanism precisely controls how far ahead the execution of the helper thread can be with respect to the application thread. We found that this is important in ensuring prefetching timeliness and avoiding cache pollution. To demonstrate that prefetching in a loosely coupled system can be done effectively, we evaluate our prefetching by simulating a standard unmodified CMP system and an intelligent memory system where a simple processor in memory executes the helper thread. Evaluating our scheme with nine memory-intensive applications with the memory processor in DRAM achieves an average speedup of 1.25. Moreover, our scheme works well in combination with a conventional processor-side sequential L1 prefetcher, resulting in an average speedup of 1.31. In a standard CMP, the scheme achieves an average speedup of 1.33. Using a real CMP system with a shared L2 cache between two cores, our helper thread prefetching plus hardware L2 prefetching achieves an average speedup of 1.15 over the hardware L2 prefetching for the subset of applications with high L2 cache misses per cycle.

Journal of Parallel and Distributed Computing | 2010

Adaptive execution techniques of parallel programs for multiprocessors

Jaejin Lee; Jungho Park; Hong-Gyu Kim; Changhee Jung; Daeseob Lim; SangYong Han

In simultaneous multithreading(SMT) multiprocessors, using all the available threads (logical processors) to run a parallel loop is not always beneficial due to the interference between threads and parallel execution overhead. To maximize the performance of a parallel loop on an SMT multiprocessor, it is important to find an appropriate number of threads for executing the parallel loop. This article presents adaptive execution techniques that find a proper execution mode for each parallel loop in a conventional loop-level parallel program on SMT multiprocessors. A compiler preprocessor generates code that, based on dynamic feedbacks, automatically determines at run time the optimal number of threads for each parallel loop in the parallel application. We evaluate our technique using a set of standard numerical applications and running them on a real SMT multiprocessor machine with 8 hardware contexts. Our approach is general enough to work well with other SMT multiprocessor or multicore systems.

international symposium on multimedia | 2006

Scheduling Data Delivery in Heterogeneous Wireless Sensor Networks

Daeseob Lim; Jaewook Shim; Tajana Simunic Rosing; Tara Javidi

In this paper we present a proxy-level scheduler that can significantly improve QoS in heterogeneous wireless sensor networks while at the same time reducing the overall power consumption. Our scheduler is transparent to both applications and MAC in order to take the advantage of the standard off-the-shelf components. The proposed scheduling reduces collisions through a generalized TDMA implementation, and thus improves throughput and QoS, by activating only a subset of stations at a time. Power savings are achieved by scheduling transfer of larger bursts of IP packets followed by longer idle periods during which nodes radio can either enter sleep or be turned off. Our simulation and measurement results show significant power savings with an improvement in QoS. On average we get 18% of saturation throughput enhancement for real traffic and 79% of power reduction in a highly loaded network

Journal of Low Power Electronics | 2011

Resource management in heterogeneous wireless sensor networks

Edoardo Regini; Daeseob Lim; Tajana Simunic Rosing

Heterogeneous wireless sensor networks such as High Performance Wireless Research and Education Network (HPWREN) have environmental sensors located in remote and hard-to-reach locations far from the main high-bandwidth data links. The sensed data needs to be routed through multiple hops before reaching the backbone. The routing is done by battery-powered nodes using license free radios such as 802.11. Minimizing energy consumption is critical to maintaining operational data links. This paper presents a solution that includes scheduling and routing algorithms and achieves up to 60% energy savings per battery operated node with 20% lower latency when compared to existing techniques. Our TDMA based scheduling algorithm limits the number of active nodes and allows a large portion of nodes to sleep thus saving energy. Since the algorithm is completely distributed and hence minimum (at join time) control packet exchange is required, nodes in sleep state can switch off the wireless network interface thus minimizing power consumption. Furthermore, results show that by limiting the number of active nodes, contention in the channel decreases and hence aggregate throughput increases up to 10%. Scheduling is combined with a dynamic creation of a backbone of nodes in charge of providing connectivity to the network and delivering data to the proper destinations. This mechanism sits on top of the unmodified MAC layer so that legacy network devices can be used, and expensive hardware/software modifications are avoided.

Archive | 2008