YongChul Kwon
University of Washington
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by YongChul Kwon.
international conference on management of data | 2012
YongChul Kwon; Magdalena Balazinska; Bill Howe; Jerome Rolia
We present an automatic skew mitigation approach for user-defined MapReduce programs and present SkewTune, a system that implements this approach as a drop-in replacement for an existing MapReduce implementation. There are three key challenges: (a) require no extra input from the user yet work for all MapReduce applications, (b) be completely transparent, and (c) impose minimal overhead if there is no skew. The SkewTune approach addresses these challenges and works as follows: When a node in the cluster becomes idle, SkewTune identifies the task with the greatest expected remaining processing time. The unprocessed input data of this straggling task is then proactively repartitioned in a way that fully utilizes the nodes in the cluster and preserves the ordering of the input data so that the original output can be reconstructed by concatenation. We implement SkewTune as an extension to Hadoop and evaluate its effectiveness using several real applications. The results show that SkewTune can significantly reduce job runtime in the presence of skew and adds little to no overhead in the absence of skew.
symposium on cloud computing | 2010
YongChul Kwon; Magdalena Balazinska; Bill Howe; Jerome Rolia
Scientists today have the ability to generate data at an unprecedented scale and rate and, as a result, they must increasingly turn to parallel data processing engines to perform their analyses. However, the simple execution model of these engines can make it difficult to implement efficient algorithms for scientific analytics. In particular, many scientific analytics require the extraction of features from data represented as either a multidimensional array or points in a multidimensional space. These applications exhibit significant computational skew, where the runtime of different partitions depends on more than just input size and can therefore vary dramatically and unpredictably. In this paper, we present SkewReduce, a new system implemented on top of Hadoop that enables users to easily express feature extraction analyses and execute them efficiently. At the heart of the SkewReduce system is an optimizer, parameterized by user-defined cost functions, that determines how best to partition the input data to minimize computational skew. Experiments on real data from two different science domains demonstrate that our approach can improve execution times by a factor of up to 8 compared to a naive implementation.
very large data bases | 2010
Nodira Khoussainova; YongChul Kwon; Magdalena Balazinska; Dan Suciu
In this paper, we present SnipSuggest, a system that provides on-the-go, context-aware assistance in the SQL composition process. SnipSuggest aims to help the increasing population of non-expert database users, who need to perform complex analysis on their large-scale datasets, but have difficulty writing SQL queries. As a user types a query, SnipSuggest recommends possible additions to various clauses in the query using relevant snippets collected from a log of past queries. SnipSuggests current capabilities include suggesting tables, views, and table-valued functions in the FROM clause, columns in the SELECT clause, predicates in the WHERE clause, columns in the GROUP BY clause, aggregates, and some support for sub-queries. SnipSuggest adjusts its recommendations according to the context: as the user writes more of the query, it is able to provide more accurate suggestions. We evaluate SnipSuggest over two query logs: one from an undergraduate database class and another from the Sloan Digital Sky Survey database. We show that SnipSuggest is able to recommend useful snippets with up to 93.7% average precision, at interactive speed. We also show that SnipSuggest outperforms naive approaches, such as recommending popular snippets.
Publications of the Astronomical Society of the Pacific | 2011
Keith Wiley; Andrew J. Connolly; Jeffrey P. Gardner; K. Simon Krughoff; Magdalena Balazinska; Bill Howe; YongChul Kwon; Yingyi Bu
In the coming decade, astronomical surveys of the sky will generate tens of terabytes of images and detect hundreds of millions of sources every night. The study of these sources will involve computation challenges such as anomaly detection and classification, and moving object tracking. Since such studies benefit from the highest quality data, methods such as image coaddition (stacking) will be a critical preprocessing step prior to scientific investigation. With a requirement that these images be analyzed on a nightly basis to identify moving sources or transient objects, these data streams present many computational challenges. Given the quantity of data involved, the computational load of these problems can only be addressed by distributing the workload over a large number of nodes. However, the high data throughput demanded by these applications may present scalability challenges for certain storage architectures. One scalable data-processing method that has emerged in recent years is MapReduce, and in this paper we focus on its popular open-source implementation called Hadoop. In the Hadoop framework, the data is partitioned among storage attached directly to worker nodes, and the processing workload is scheduled in parallel on the nodes that contain the required input data. A further motivation for using Hadoop is that it allows us to exploit cloud computing resources, e.g., Amazons EC2. We report on our experience implementing a scalable image-processing pipeline for the SDSS imaging database using Hadoop. This multi-terabyte imaging dataset provides a good testbed for algorithm development since its scope and structure approximate future surveys. First, we describe MapReduce and how we adapted image coaddition to the MapReduce framework. Then we describe a number of optimizations to our basic approach and report experimental results comparing their performance.
very large data bases | 2008
YongChul Kwon; Magdalena Balazinska; Albert G. Greenberg
We present SGuard, a new fault-tolerance technique for distributed stream processing engines (SPEs) running in clusters of commodity servers. SGuard is less disruptive to normal stream processing and leaves more resources available for normal stream processing than previous proposals. Like several previous schemes, SGuard is based on rollback recovery [18]: it checkpoints the state of stream processing nodes periodically and restarts failed nodes from their most recent checkpoints. In contrast to previous proposals, however, SGuard performs checkpoints asynchronously: i.e., operators continue processing streams during the checkpoint thus reducing the potential disruption due to the checkpointing activity. Additionally, SGuard saves the checkpointed state into a new type of distributed and replicated file system (DFS) such as GFS [22] or HDFS [9], leaving more memory resources available for normal stream processing. To manage resource contention due to simultaneous checkpoints by different SPE nodes, SGuard adds a scheduler to the DFS. This scheduler coordinates large batches of write requests in a manner that reduces individual checkpoint times while maintaining good overall resource utilization. We demonstrate the effectiveness of the approach through measurements of a prototype implementation in the Borealis [2] open-source SPE using HDFS [9] as the DFS.
international conference on cluster computing | 2009
Sarah Loebman; Dylan Nunley; YongChul Kwon; Bill Howe; Magdalena Balazinska; Jeffrey P. Gardner
As the datasets used to fuel modern scientific discovery grow increasingly large, they become increasingly difficult to manage using conventional software. Parallel database management systems (DBMSs) and massive-scale data processing systems such as MapReduce hold promise to address this challenge. However, since these systems have not been expressly designed for scientific applications, their efficacy in this domain has not been thoroughly tested. In this paper, we study the performance of these engines in one specific domain: massive astrophysical simulations. We develop a use case that comprises five representative queries. We implement this use case in one distributed DBMS and in the Pig/Hadoop system. We compare the performance of the tools to each other and to hand-written IDL scripts. We find that certain representative analyses are easy to express in each engines highlevel language and both systems provide competitive performance and improved scalability relative to current IDL-based methods.
international conference on management of data | 2011
Prasang Upadhyaya; YongChul Kwon; Magdalena Balazinska
We address the problem of making online, parallel query plans fault-tolerant: i.e., provide intra-query fault-tolerance without blocking. We develop an approach that not only achieves this goal but does so through the use of different fault-tolerance techniques at different operators within a query plan. Enabling each operator to use a different fault-tolerance strategy leads to a space of fault-tolerance plans amenable to cost-based optimization. We develop FTOpt, a cost-based fault-tolerance optimizer that automatically selects the best strategy for each operator in a query plan in a manner that minimizes the expected processing time with failures for the entire query. We implement our approach in a prototype parallel query-processing engine. Our experiments demonstrate that (1) there is no single best fault-tolerance strategy for all query plans, (2) often hybrid strategies that mix-and-match recovery techniques outperform any uniform strategy, and (3) our optimizer correctly identifies winning fault-tolerance configurations.
statistical and scientific database management | 2011
Nodira Khoussainova; YongChul Kwon; Wei Ting Liao; Magdalena Balazinska; Wolfgang Gatterbauer; Dan Suciu
Scientists today are able to generate and collect data at an unprecedented scale [1]. Afterwards, scientists analyze and explore these datasets. Composing SQL queries, however, is a significant challenge for scientists because most are not database experts.
very large data bases | 2012
YongChul Kwon; Magdalena Balazinska; Bill Howe; Jerome Rolia
We demonstrate SkewTune, a system that automatically mitigates skew in user-defined MapReduce programs and is a drop-in replacement for Hadoop. The demonstration has two parts. First, we demonstrate how SkewTune mitigates skew in real MapReduce applications at runtime by running a real application in a public cloud. Second, through an interactive graphical interface, we demonstrate the details of the skew mitigation process using both real and synthetic workloads that represent various skew configurations.
international conference on data mining | 2008
YongChul Kwon; Wing Yee Lee; Magdalena Balazinska; Guiping Xu
Monitoring applications play an increasingly important role in many domains. They detect events in monitored systems and take actions such as invoke a program or notify an administrator. Often administrators must then manually investigate events to figure out the source of a problem. Stream processing engines (SPEs) are general purpose data management systems for monitoring applications. They provide low-latency stream processing but have limited or no support for manual event investigation. In this paper, we propose a new technique for an SPE to support event investigation by automatically classifying events on streams. Unlike previous stream clustering algorithms, our approach takes into account complex user-defined contexts for events. Our approach comprises three key components: an event context data model, a distance measure for event contexts, and an online clustering algorithm for event contexts. We evaluate our approach using synthetic data and show that complex context information can improve online event classification.