Sunok Kim
Yonsei University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sunok Kim.
international conference on consumer electronics | 1991
Sunok Kim; Young-Keun Park; Dae Hee Youn; Won Kim; ByungIn Yoo
The authors describe a bit rate reduction algorithm for use in digital VCRs, based on block scrambling and the two-dimensional adaptive discrete cosine transform. This method satisfies many of the demands unique to digital VCRs. Before the encoding process, the input image is divided into 8*8 pixel blocks and then scrambled on a block-by-block basis to achieve efficient bit distribution in a frame and to simplify buffer control strategy. In addition to block scrambling, a simple yet efficient variable length coding method is proposed. Simulation results are presented. >
IEEE Transactions on Image Processing | 2017
Sunok Kim; Dongbo Min; Seungryong Kim; Kwanghoon Sohn
Confidence estimation is essential for refining stereo matching results through a post-processing step. This problem has recently been studied using a learning-based approach, which demonstrates a substantial improvement on conventional simple non-learning based methods. However, the formulation of learning-based methods that individually estimates the confidence of each pixel disregards spatial coherency that might exist in the confidence map, thus providing a limited performance under challenging conditions. Our key observation is that the confidence features and resulting confidence maps are smoothly varying in the spatial domain, and highly correlated within the local regions of an image. We present a new approach that imposes spatial consistency on the confidence estimation. Specifically, a set of robust confidence features is extracted from each superpixel decomposed using the Gaussian mixture model, and then these features are concatenated with pixel-level confidence features. The features are then enhanced through adaptive filtering in the feature domain. In addition, the resulting confidence map, estimated using the confidence features with a random regression forest, is further improved through K-nearest neighbor based aggregation scheme on both pixel- and superpixel-level. To validate the proposed confidence estimation scheme, we employ cost modulation or ground control points based optimization in stereo matching. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches on various benchmarks including challenging outdoor scenes.Confidence estimation is essential for refining stereo matching results through a post-processing step. This problem has recently been studied using a learning-based approach, which demonstrates a substantial improvement on conventional simple non-learning based methods. However, the formulation of learning-based methods that individually estimates the confidence of each pixel disregards spatial coherency that might exist in the confidence map, thus providing a limited performance under challenging conditions. Our key observation is that the confidence features and resulting confidence maps are smoothly varying in the spatial domain, and highly correlated within the local regions of an image. We present a new approach that imposes spatial consistency on the confidence estimation. Specifically, a set of robust confidence features is extracted from each superpixel decomposed using the Gaussian mixture model, and then these features are concatenated with pixel-level confidence features. The features are then enhanced through adaptive filtering in the feature domain. In addition, the resulting confidence map, estimated using the confidence features with a random regression forest, is further improved through K-nearest neighbor based aggregation scheme on both pixel- and superpixel-level. To validate the proposed confidence estimation scheme, we employ cost modulation or ground control points based optimization in stereo matching. Experimental results demonstrate that the proposed method outperforms state-of-the-art approaches on various benchmarks including challenging outdoor scenes.
international conference on image processing | 2015
Sunok Kim; Sunghwan Choi; Kwanghoon Sohn
Estimating depth from a single monocular image is a fundamental problem in computer vision. Traditional methods for such estimation usually require complicated and sometimes labor-intensive processing. In this paper, we propose a new perspective for this problem and suggest a new gradient-domain learning framework which is much simpler and more efficient. Inspired by the observation that there is substantial co-occurrence of image edges and depth discontinuities in natural scenes, we learn the relationship between local appearance features and corresponding depth gradients by making use of the K-means clustering algorithm within the image feature space. We then encode each cluster centroid with its associated depth gradients, which defines visual-depth words that model the image-depth relationship very well. This enables one to estimate the scene depth for an arbitrary image by simply selecting proper depth gradients from a compact dictionary of visual-depth words, followed by a Poisson surface reconstruction. Experimental results demonstrate that the proposed gradient-domain approach outperforms state-of-the-art methods both qualitatively and quantitatively and is generic over (unseen) scene categories which are not used for training.
international conference on consumer electronics | 2017
Kyuwon Kim; Sunok Kim; Kwanghoon Sohn
Dragging a rectangular box is arguably one of the most intuitive and popular user interfaces to select a whole object. However, the conventional dragging method cannot be easily applied on touch-screen devices. This paper proposes a more convenient and lazier way of bounding-box (BB) drawing for touch-screen devices. The proposed method of BB drawing, termed Lazy Dragging, is designed to yield accurate BB results at interactive speeds. To this end, Lazy Dragging proceeds in two stages: (i) it first filters out explicit non-borders via graph-based segmentation; (ii) it then singles out the best four box sides among the remaining candidates using edge feature-based random forest. In this second stage, global and local edge-based border detection and a combined approach of these global and local schemes are utilized. This paper also proposes an optional stage to simplify and ameliorate further user refinement by providing invisible magnetic guidelines. This technique guides a users touch to the nearest superpixel boundaries. Extensive experiments on a real-world dataset demonstrate that Lazy Dragging convincingly enhances the quality of input BBs in quasi real-time, enabling effortless object selection on small touch-screen devices.
IEEE Transactions on Consumer Electronics | 2017
Kyuwon Kim; Sunok Kim; Kwanghoon Sohn
Dragging rectangular boxes with a finger is among the most intuitive and popular operation to select a whole object. This work proposes a convenient method of bounding-box drawing for touch-screen devices, called Lazy Dragging. To achieve real-time and accurate performance, it first filters out explicit non-borders via graph-based segmentation. It then singles out the best four box sides among the remaining candidates using edge feature-based Random Forest. Experiments on a real-world dataset demonstrate that Lazy Dragging convincingly enhances the quality of bounding-box inputs, enabling easy object selection on small touch-screen devices.
Proceedings of the 2018 Workshop on Audio-Visual Scene Understanding for Immersive Multimedia - AVSU'18 | 2018
Ji Young Lee; Sunok Kim; Seungryong Kim; Kwanghoon Sohn
We present a spatiotemporal attention based multimodal deep neu- ral networks for dimensional emotion recognition in multimodal audio-visual video sequence. To learn the temporal attention that discriminatively focuses on emotional sailient parts within speech audios, we formulate the temporal attention network using deep neural networks (DNNs). In addition, to learn the spatiotemporal attention that selectively focuses on emotional sailient parts within facial videos, the spatiotemporal encoder-decoder network is formu- lated using Convolutional LSTM (ConvLSTM) modules, and learned implicitly without any pixel-level annotations. By leveraging the spatiotemporal attention, the 3D convolutional neural networks (3D-CNNs) is also formulated to robustly recognize the dimensional emotion in facial videos. Furthermore, to exploit multimodal infor- mation, we fuse the audio and video features to emotion regression model. The experimental results show that our method can achieve the state-of-the-art results in dimensional emotion recognition with the highest concordance correlation coefficient (CCC) on AV+EC 2017 dataset.
Proceedings of the 1st Workshop and Challenge on Comprehensive Video Understanding in the Wild - CoVieW'18 | 2018
Jungin Park; Sangryul Jeon; Seungryong Kim; Ji Young Lee; Sunok Kim; Kwanghoon Sohn
While recognizing human actions and surrounding scenes addresses different aspects of video understanding, they have strong correlations that can be used to complement the singular information of each other. In this paper, we propose an approach for joint action and scene recognition that is formulated in an end-to-end learning framework based on temporal attention techniques and the fusion of them. By applying temporal attention modules to the generic feature network, action and scene features are extracted efficiently, and then they are composed to a single feature vector through the proposed fusion module. Our experiments on the CoVieW18 dataset show that our model is able to detect temporal attention with only weak supervision, and remarkably improves multi-task action and scene classification accuracies.
IEEE Transactions on Image Processing | 2017
Seungryong Kim; Rui Cai; Kihong Park; Sunok Kim; Kwanghoon Sohn
We present a unified framework for the image classification of image sets taken under varying modality conditions. Our method is motivated by a key observation that the image feature distribution is simultaneously influenced by the semantic-class and the modality category label, which limits the performance of conventional methods for that task. With this insight, we introduce modality uniqueness as a discriminative weight that divides each modality cluster from all other clusters. By leveraging the modality uniqueness, our framework is formulated as unsupervised modality clustering and classifier learning based on modality-invariant similarity kernel. Specifically, in the assignment step, each training image is first assigned to the most similar cluster according to its modality. In the update step, based on the current cluster hypothesis, the modality uniqueness and the sparse dictionary are updated. These two steps are formulated in an iterative manner. Based on the final clusters, a modality-invariant marginalized kernel is then computed, where the similarities between the reconstructed features of each modality are aggregated across all clusters. Our framework enables the reliable inference of semantic-class category for an image, even across large photometric variations. Experimental results show that our method outperforms conventional methods on various benchmarks, such as landmark identification under severely varying weather conditions, domain-adapting image classification, and RGB and near-infrared image classification.We present a unified framework for the image classification of image sets taken under varying modality conditions. Our method is motivated by a key observation that the image feature distribution is simultaneously influenced by the semantic-class and the modality category label, which limits the performance of conventional methods for that task. With this insight, we introduce modality uniqueness as a discriminative weight that divides each modality cluster from all other clusters. By leveraging the modality uniqueness, our framework is formulated as unsupervised modality clustering and classifier learning based on modality-invariant similarity kernel. Specifically, in the assignment step, each training image is first assigned to the most similar cluster according to its modality. In the update step, based on the current cluster hypothesis, the modality uniqueness and the sparse dictionary are updated. These two steps are formulated in an iterative manner. Based on the final clusters, a modality-invariant marginalized kernel is then computed, where the similarities between the reconstructed features of each modality are aggregated across all clusters. Our framework enables the reliable inference of semantic-class category for an image, even across large photometric variations. Experimental results show that our method outperforms conventional methods on various benchmarks, such as landmark identification under severely varying weather conditions, domain-adapting image classification, and RGB and near-infrared image classification.
Three-Dimensional Imaging, Visualization, and Display 2016 | 2016
Sunok Kim; Youngjung Kim; Kwanghoon Sohn
In the past few years, depth estimation from a single image has received increased attentions due to its wide applicability in image and video understanding. For realizing these tasks, many approaches have been developed for estimating depth from a single image based on various depth cues such as shading, motion, etc. However, they failed to estimate plausible depth map when input color image is derived from different category in training images. To alleviate these problems, data-driven approaches have been popularly developed by leveraging the discriminative power of a large scale RGB-D database. These approaches assume that there exists appearance- depth correlation in natural scenes. However, this assumption is likely to be ambiguous when local image regions have similar appearance but different geometric placement within the scene. Recently, a depth analogy (DA) has been developed by using the correlation between color image and depth gradient. DA addresses depth ambiguity problem effectively and shows reliable performance. However, no experiments are conducted to investigate the relationship between database scale and the quality of the estimated depth map. In this paper, we extensively examine the effects of database scale and quality on the performance of DA method. In order to compare the quality of DA, we collect a large scale RGB-D database using Microsoft Kinect v1 and Kinect v2 on indoor and ZED stereo camera on outdoor environments. Since the depth map obtained by Kinect v2 has high quality compared to that of Kinect v1, the depth maps from the database from Kinect v2 are more reliable. It represents that the high quality and large scale RGB-D database guarantees the high quality of the depth estimation. The experimental results show that the high quality and large scale training database leads high quality estimated depth map in both indoor and outdoor scenes.
electronic imaging | 2015
Sunok Kim; Changjae Oh; Youngjung Kim; Kwanghoon Sohn
This paper presents a probabilistic optimization approach to enhance the resolution of a depth map. Conventionally, a high-resolution color image is considered as a cue for depth super-resolution under the assumption that the pixels with similar color likely belong to similar depth. This assumption might induce a texture transferring from the color image into the depth map and an edge blurring artifact to the depth boundaries. In order to alleviate these problems, we propose an efficient depth prior exploiting a Gaussian mixture model in which an estimated depth map is considered to a feature for computing affinity between two pixels. Furthermore, a fixed-point iteration scheme is adopted to address the non-linearity of a constraint derived from the proposed prior. The experimental results show that the proposed method outperforms state-of-the-art methods both quantitatively and qualitatively.