Soojung Ryu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Soojung Ryu is active.

Explore More

Publication

Featured researches published by Soojung Ryu.

high-performance computer architecture | 2014

Improving GPGPU resource utilization through alternative thread block scheduling

Minseok Lee; Seokwoo Song; Joosik Moon; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu

High performance in GPGPU workloads is obtained by maximizing parallelism and fully utilizing the available resources. The thousands of threads are assigned to each core in units of CTA (Cooperative Thread Arrays) or thread blocks - with each thread block consisting of multiple warps or wavefronts. The scheduling of the threads can have significant impact on overall performance. In this work, explore alternative thread block or CTA scheduling; in particular, we exploit the interaction between the thread block scheduler and the warp scheduler to improve performance. We explore two aspects of thread block scheduling - (1) LCS (lazy CTA scheduling) which restricts the maximum number of thread blocks allocated to each core, and (2) BCS (block CTA scheduling) where consecutive thread blocks are assigned to the same core. For LCS, we leverage a greedy warp scheduler to help determine the optimal number of thread blocks by only measuring the number of instructions issued while for BCS, we propose an alternative warp scheduler that is aware of the “block” of CTAs allocated to a core. With LCS and the observation that maximum number of CTAs does not necessary maximize performance, we also propose mixed concurrent kernel execution that enables multiple kernels to be allocated to the same core to maximize resource utilization and improve overall performance.

field-programmable technology | 2012

ULP-SRP: Ultra Low-Power Samsung Reconfigurable Processor for Biomedical Applications

Changmoo Kim; Moo-Kyoung Chung; Yeongon Cho; Mario Konijnenburg; Soojung Ryu; Jeongwook Kim

The latest biomedical applications require low energy consumption, high performance and wide energy-performance scalability to adapt to various working environments. This paper presents ULP-SRP, an energy efficient reconfigurable processor for the biomedical applications. ULP-SRP uses a Coarse Grained Reconfigurable Array (CGRA) for high performance data processing with low energy consumption. For the scalability, we propose three performance modes and Unified Memory Architecture (UMA). Energy optimization is accomplished by run-time mode switching along with automatic power gating. Experimental results show that ULP-SRP achieved 46.1% energy reduction compared to previous works.

international solid-state circuits conference | 2013

Reliable and energy-efficient 1MHz 0.4V dynamically reconfigurable SoC for ExG applications in 40nm LP CMOS

Mario Konijnenburg; Yeongojn Cho; Maryam Ashouei; Tobias Gemmeke; Changmoo Kim; Jos Hulzink; Jan Stuyt; Mookyung Jung; Jos Huisken; Soojung Ryu; Jung-Wook Kim; H. de Groot

Wireless Sensor Nodes (WSN) have a wide range of applications in health care and life style monitoring. Their severe energy constraint is often addressed through minimizing the amount of transmitted data by way of energy-efficient on-node signal processing. The rationale for this approach is that a large portion of WSN energy is consumed by the radio communication even for very low-data-rate situations [1]. Efficient on-node processing has been the subject of recent work, with the common element being aggressive voltage scaling into the sub-threshold region [2-4]. A major assumption of the existing works is that the amount of required computation is low, justifying an on-node processor with limited computational capability. While this might be the case for many applications of WSNs, emerging ambulatory biomedical signal processing applications exceed the performance offered by todays on-node processors.

field-programmable technology | 2012

Design space exploration and implementation of a high performance and low area Coarse Grained Reconfigurable Processor

Dong-kwan Suh; Ki-seok Kwon; Suk-Jin Kim; Soojung Ryu; Jeongwook Kim

Coarse Grained Reconfigurable Architectures (CGRAs) have played a key role in the area of domain specific processors due to their programmability and runtime reconfigurability. The Coarse Grained Array (CGA) structure enables target designs to achieve high performance, but it is easy to fall into over-design in term of area. Moreover, the network overhead between the function units (FUs) seriously degrades its clock speed. In this paper, we propose a high performance CGRA that facilitates design space exploration (DSE) to reduce these overheads. It employs a concept of building blocks, named mini cores, to mitigate overhead involved in DSE that aims to achieve high clock speed and small area in the target design. The proposed approach reduces the design time more than 100 times compared with previous design. Experimental results show that the implemented architecture reduces logic area by 14.38% and improves clock frequency by 59.34% without performance loss.

asia and south pacific design automation conference | 2013

Reevaluating the latency claims of 3D stacked memories

Daniel W. Chang; Gyung-Su Byun; Ho-Young Kim; Min-wook Ahn; Soojung Ryu; Nam Sung Kim; Michael J. Schulte

In recent years, 3D technology has been a popular area of study that has allowed researchers to explore a number of novel computer architectures. One of the more popular topics is that of integrating 3D main memory dies below the computing die and connecting them with through-silicon vias (TSVs). This is assumed to reduce off-chip main memory access latencies by roughly 45% to 60%. Our detailed circuit-level models, however, demonstrate that this latency reduction from the TSVs is significantly less. In this paper, we present these models, compare 2D and 3D main memory latencies, and show that the reduction in latency from using 3D main memory to be no more than 2.4 ns. We also show that although the wider I/O bus width enabled by using TSVs increases performance, it may do so with an increase in power consumption. Although TSVs consume less power per bit transfer than off-chip metal interconnects (11.2 times less power per bit transfer), TSVs typically use considerably more bits and may result in a net increase in power due to the large number of bits in the memory I/O bus. Our analysis shows that although a 3D memory hierarchy exploiting a wider memory bus can increase performance, this performance increase may not justify the net increase in power consumption.

international conference on computer design | 2012

Providing cost-effective on-chip network bandwidth in GPGPUs

Hanjoon Kim; John Kim; Woong Seo; Yeongon Cho; Soojung Ryu

Network-on-chip (NoC) bandwidth has a significant impact on overall performance in throughput-oriented processors such as GPG-PUs. Although it has been commonly assumed that high NoC bandwidth can be provided through abundant on-chip wires, we show that increasing NoC router frequency results in a more cost-effective NoC. However, router arbitration critical path can limit the NoC router frequency. Thus, we propose a direct all-to-all network overlaid on mesh (DA2mesh) NoC architecture that exploits the traffic characteristics of GPGPU and removes arbitration from the router pipeline. DA2mesh simplifies the router pipeline with 36% improvement of performance while reducing NoC energy by 15%.

international symposium on circuits and systems | 2014

SimParallel: A high performance parallel SystemC simulator using hierarchical multi-threading

Moo-Kyoung Chung; Jun-Kyoung Kim; Soojung Ryu

As the system complexity increases, the simulation performance becomes one of the most important issues in virtual prototyping. Parallel simulation is a fascinating technique for high-speed simulation utilizing state of the art multi-core processors on a host workstation, but the efficiency of the parallel simulation is low because of the synchronization and communication overhead and unbalanced workloads among cores in the host. This paper proposes a novel technique, hierarchical multi-threading for the efficient parallel simulation of SystemC models where the host cores are able to be maximally utilized with the same number of thread groups. We also present an efficient synchronization and dynamic load balancing scheme for the proposed parallel simulation. Experimental results show that the proposed method achieves speed-up of from 2.9 to 3.3 in quad-core host workstation.

international conference on computer graphics and interactive techniques | 2013

Real-time ray tracing on future mobile computing platform

Won-Jong Lee; Youngsam Shin; Jae Don Lee; Shihwa Lee; Soojung Ryu; Jeongwook Kim

In this work, we present a novel mobile computing platfom for mobile ray tracing in which a fast compact hardware accelerator and a flexible programmable shader are combined. Our platform has two key features: 1) an area-efficient parallel pipelined traversal unit; and 2) flexible and high-performance kernels for shading and ray generation. Simulation results show that our platform is potentially a versatile graphics solution for future application processors as it provides a real-time ray tracing performance at full HD resolution that can compete with that of existing desktop GPU ray tracers. Our system is implemented on an FPGA platform, and mobile ray tracing is successfully demonstrated.

field-programmable technology | 2013

Real-time ray tracing on coarse-grained reconfigurable processor

Jaedon Lee; Youngsam Shin; Won-Jong Lee; Soojung Ryu; Jeongwook Kim

Ray tracing is a 3D rendering method for generating an image by simulating the path of light. It can generate high quality images, but it requires great computing power. Recent advances in ray tracing technology enable realtime ray tracing on modern desktop CPUs/GPUs. But in the current mobile environment, it is difficult because of inadequate computing power, memory bandwidth, and flexibility in mobile GPUs. In this paper, we present a mobile ray tracing system using Samsung Reconfigurable Processor (SRP). SRP architecture includes a tightly coupled very long instruction word (VLIW) engine and coarse-grained reconfigurable array (CGRA). The VLIW engine is designed for general-purpose computations, such as function invocation and branch selection, and the coarsegrained reconfigurable array is specialized for data-intensive part of the program and can be configured dynamically. We proposed iterative batch-based ray tracing algorithm for SRP, and optimized memory bandwidth with local memory and data cache. Our ray tracing system is implemented on a commercial FPGA-based prototyping system. The experimental results show that our system is suitable for the mobile ray tracing.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2013

Mapping and Scheduling of Tasks and Communications on Many-Core SoC Under Local Memory Constraint

Jinho Lee; Moo-Kyoung Chung; Yeongon Cho; Soojung Ryu; Jung Ho Ahn; Kiyoung Choi

There has been extensive research on mapping and scheduling tasks on a many-core SoC. However, none considers the optimization of communication types, which can significantly affect performance, energy consumption, and local memory usage of the SoC. This paper presents an approach to automatic mapping and scheduling of tasks and communications on a many-core SoC. The key idea is to decide the type of each communication between message passing and shared memory when we do the mapping and scheduling. By assigning a proper type to each communication, we can optimize the energy consumption, performance, or energy-delay product. To solve the optimization problem, the approach adopts a probabilistic algorithm coupled with some heuristics. To enhance throughput of the system, it performs software pipelined scheduling of the tasks using a modified iterative modulo scheduling technique. Experiments show that our algorithm achieves on average 50.1% lower energy consumption, 21.0% higher throughput, and 64.9% lower energy- delay product, compared to shared memory only communication.

Explore More