Won Woo Ro | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Won Woo Ro is active.

Explore More

Publication

Featured researches published by Won Woo Ro.

international symposium on computer architecture | 2015

Warped-compression: enabling power efficient GPUs through register compression

Sangpil Lee; Keunsoo Kim; Gunjae Koo; Hyeran Jeon; Won Woo Ro; Murali Annavaram

This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.

Journal of Communications and Networks | 2008

Efficient peer-to-peer file sharing using network coding in MANET

Uichin Lee; Joon-Sang Park; Seung-Hoon Lee; Won Woo Ro; Giovanni Pau; Mario Gerla

Mobile peer-to-peer (P2P) systems have recently got in the limelight of the research community that is striving to build efficient and effective mobile content addressable networks. Along this line of research, we propose a new peer-to-peer file sharing protocol suited to mobile ad hoc networks (MANET). The main ingredients of our protocol are network coding and mobility assisted data propagation, i.e., single-hop communication. We argue that network coding in combination with single-hop communication allows P2P file sharing systems in MANET to operate in a more efficient manner and helps the systems to deal with typical MANET issues such as dynamic topology and intermittent connectivity as well as various other issues that have been disregarded in previous MANET P2P researches such as addressing, node/user density, non-cooperativeness, and unreliable channel. Via simulation, we show that our P2P protocol based on network coding and single-hop communication allows shorter file downloading delays compared to an existing MANET P2P protocol.

IEEE Transactions on Parallel and Distributed Systems | 2010

On Improving Parallelized Network Coding with Dynamic Partitioning

Karam Park; Joon-Sang Park; Won Woo Ro

In this paper, we investigate parallel implementation techniques for network coding. It is known that network coding is useful for both wired and wireless networks and it also mitigates peer/piece selection problems in P2P file sharing systems. However, due to the decoding complexity of network coding, there have been concerns about adoption of network coding in practical network systems and to improve the decoding performance, the exploitation of parallelism has been proposed previously. In this paper, we argue that naive parallelization strategies of network coding may result in unbalanced workload distribution, and thus, limiting performance improvements. We further argue that a higher performance enhancement can be achieved through balanced partitioning methods in parallelized network coding and propose new parallelization techniques for network coding. Our experiments show that on a quad-core processor system, proposed algorithms exhibit up to 5.69 speedup which is better than the linear speedup with the influence of additional cache. Moreover, on an octal-core system, our algorithms even achieve speedup of 8.46 compared to a sequential network coding and 43.3 percent faster than an existing parallelized technique using 1 Mbytes data with 1,024 \times 1,024 coefficient matrix size.

The Computer Journal | 2012

Accelerated Network Coding with Dynamic Stream Decomposition on Graphics Processing Unit

Sangpil Lee; Won Woo Ro

Network coding, a well-known technique for optimizing data-flow in wired and wireless network systems, has attracted considerable attention in various fields. However, the decoding complexity in network coding becomes a major performance bottleneck in the practical network systems; thus, several researches have been conducted for improving the decoding performance in network coding. Nevertheless, previously proposed parallel network coding algorithms have shown limited scalability and performance imbalance for different-sized transfer units and multiple streams. In this paper, we propose a new parallel decoding algorithm for network coding using a graphics processing unit (GPU). This algorithm can simultaneously process multiple incoming streams and can maintain its maximum decoding performance irrespective of the size and number of transfer units. Our experimental results show that the proposed algorithm exhibits a 682.2 Mbps decoding bandwidth on a system with GeForce GTX 285 GPU and speed-ups of up to 26 as compared to the existing single stream decoding procedure with a 128 × 128 coefficient matrix and different-sized data blocks.

international symposium on computer architecture | 2016

Warped-slicer: efficient intra-SM slicing through dynamic resource partitioning for GPU multiprogramming

Qiumin Xu; Hyeran Jeon; Keunsoo Kim; Won Woo Ro; Murali Annavaram

As technology scales, GPUs are forecasted to incorporate an ever-increasing amount of computing resources to support thread-level parallelism. But even with the best effort, exposing massive thread-level parallelism from a single GPU kernel, particularly from general purpose applications, is going to be a difficult challenge. In some cases, even if there is sufficient thread-level parallelism in a kernel, there may not be enough available memory bandwidth to support such massive concurrent thread execution. Hence, GPU resources may be underutilized as more general purpose applications are ported to execute on GPUs. In this paper, we explore multiprogramming GPUs as a way to resolve the resource underutilization issue. There is a growing hardware support for multiprogramming on GPUs. Hyper-Q has been introduced in the Kepler architecture which enables multiple kernels to be invoked via tens of hardware queue streams. Spatial multitasking has been proposed to partition GPU resources across multiple kernels. But the partitioning is done at the coarse granularity of streaming multiprocessors (SMs) where each kernel is assigned to a subset of SMs. In this paper, we advocate for partitioning a single SM across multiple kernels, which we term as intra-SM slicing. We explore various intra-SM slicing strategies that slice resources within each SM to concurrently run multiple kernels on the SM. Our results show that there is not one intra-SM slicing strategy that derives the best performance for all application pairs. We propose Warped-Slicer, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance. The model relies on a set of short online profile runs to determine how each kernels performance varies as more thread blocks from each kernel are assigned to an SM. The model takes into account the interference effect of shared resource usage across multiple kernels. The model is also computationally efficient and can determine the resource partitioning quickly to enable dynamic decision making as new kernels enter the system. We demonstrate that the proposed Warped-Slicer approach improves performance by 23% over the baseline multiprogramming approach with minimal hardware overhead.

international conference on human computer interaction | 2012

Cooperative heterogeneous computing for parallel processing on CPU/GPU hybrids

Changmin Lee; Won Woo Ro; Jean-Luc Gaudiot

This paper presents a cooperative heterogeneous computing framework which enables the efficient utilization of available computing resources of host CPU cores for CUDA kernels, which are designed to run only on GPU. The proposed system exploits at runtime the coarse-grain thread-level parallelism across CPU and GPU, without any source recompilation. To this end, three features including a work distribution module, a transparent memory space, and a global scheduling queue are described in this paper. With a completely automatic runtime workload distribution, the proposed framework achieves speedups as high as 3.08 compared to the baseline GPU-only processing.

International Journal of Parallel Programming | 2014