Kyu-Young Whang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kyu-Young Whang is active.

Explore More

Publication

Featured researches published by Kyu-Young Whang.

international conference on management of data | 2007

Trajectory clustering: a partition-and-group framework

Jae Gil Lee; Jiawei Han; Kyu-Young Whang

Existing trajectory clustering algorithms group similar trajectories as a whole, thus discovering common trajectories. Our key observation is that clustering trajectories as a whole could miss common sub-trajectories. Discovering common sub-trajectories is very useful in many applications, especially if we have regions of special interest for analysis. In this paper, we propose a new partition-and-group framework for clustering trajectories, which partitions a trajectory into a set of line segments, and then, groups similar line segments together into a cluster. The primary advantage of this framework is to discover common sub-trajectories from a trajectory database. Based on this partition-and-group framework, we develop a trajectory clustering algorithm TRACLUS. Our algorithm consists of two phases: partitioning and grouping. For the first phase, we present a formal trajectory partitioning algorithm using the minimum description length(MDL) principle. For the second phase, we present a density-based line-segment clustering algorithm. Experimental results demonstrate that TRACLUS correctly discovers common sub-trajectories from real trajectory data.

ACM Transactions on Database Systems | 1990

A linear-time probabilistic counting algorithm for database applications

Kyu-Young Whang; Brad T. Vander-Zanden; Howard M. Taylor

We present a probabilistic algorithm for counting the number of unique values in the presence of duplicates. This algorithm has O(q) time complexity, where q is the number of values including duplicates, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space. Traditionally, accurate counts of unique values were obtained by sorting, which has O(q log q) time complexity. Our technique, called linear counting, is based on hashing. We present a comprehensive theoretical and experimental analysis of linear counting. The analysis reveals an interesting result: A load factor (number of unique values/hash table size) much larger than 1.0 (e.g., 12) can be used for accurate estimation (e.g., 1% of error). We present this technique with two important applications to database problems: namely, (1) obtaining the column cardinality (the number of unique values in a column of a relation) and (2) obtaining the join selectivity (the number of unique values in the join column resulting from an unconditional join divided by the number of unique join column values in the relation to he joined). These two parameters are important statistics that are used in relational query optimization and physical database design.

international conference on management of data | 2002

General match: a subsequence matching method in time-series databases based on generalized windows

Yang-Sae Moon; Kyu-Young Whang; Wook-Shin Han

We generalize the method of constructing windows in subsequence matching. By this generalization, we can explain earlier subsequence matching methods as special cases of a common framework. Based on the generalization, we propose a new subsequence matching method, General Match. The earlier work by Faloutsos et al. (called FRM for convenience) causes a lot of false alarms due to lack of point-filtering effect. Dual Match, recently proposed as a dual approach of FRM, improves performance significantly over FRM by exploiting point filtering effect. However, it has the problem of having a smaller allowable window size---half that of FRM---given the minimum query length. A smaller window increases false alarms due to window size effect. General Match offers advantages of both methods: it can reduce window size effect by using large windows like FRM and, at the same time, can exploit point-filtering effect like Dual Match. General Match divides data sequences into generalized sliding windows (J-sliding windows) and the query sequence into generalized disjoint windows (J-disjoint windows). We formally prove that General Match is correct, i.e., it incurs no false dismissal. We then propose a method of estimating the optimal value of the sliding factor J that minimizes the number of page accesses. Experimental results for real stock data show that, for low selectivities (10-6∼10-4), General Match improves average performance by 117% over Dual Match and by 998% over FRM; for high selectivities (10-3∼10-1), by 45% over Dual Match and by 64% over FRM. The proposed generalization provides an excellent theoretical basis for understanding the underlying mechanisms of subsequence matching.

ACM Transactions on Database Systems | 1990

Query optimization in a memory-resident domain relational calculus database system

Kyu-Young Whang; Ravi Krishnamurthy

We present techniques for optimizing queries in memory-resident database systems. Optimization techniques in memory-resident database systems differ significantly from those in conventional disk-resident database systems. In this paper we address the following aspects of query optimization in such systems and present specific solutions for them: (1) a new approach to developing a CPU-intensive cost model; (2) new optimization strategies for main-memory query processing; (3) new insight into join algorithms and access structures that take advantage of memory residency of data; and (4) the effect of the operating systems scheduling algorithm on the memory-residency assumption. We present an interesting result that a major cost of processing queries in memory-resident database systems is incurred by evaluation of predicates. We discuss optimization techniques using the Office-by-Example (OBE) that has been under development at IBM Research. We also present the results of performance measurements, which prove to be excellent in the current state of the art. Despite recent work on memory-resident database systems, query optimization aspects in these systems have not been well studied. We believe this paper opens the issues of query optimization in memory-resident database systems and presents practical solutions to them.

very large data bases | 1994

Dynamic maintenance of data distribution for selectivity estimation

Kyu-Young Whang; Sang-Wook Kim; Gio Wiederhold

We propose a new dynamic method for multidimensional selectivity estimation for range queries that works accurately independent of data distribution. Good estimation of selectivity is important for query optimization and physical database design. Our method employs the multilevel grid file (MLGF) for accurate estimation of multidimensional data distribution. The MLGF is a dynamic, hierarchical, balanced, multidimensional file structure that gracefully adapts to nonuniform and correlated distributions. We show that the MLGF directory naturally represents a multidimensional data distribution. We then extend it for further refinement and present the selectivity estimation method based on the MLGF. Extensive experiments have been performed to test the accuracy of selectivity estimation. The results show that estimation errors are very small independent of distributions, even with correlated and/or highly skewed ones. Finally, we analyze the cause of errors in estimation and investigate the effects of various parameters on the accuracy of estimation.

IEEE Transactions on Computers | 1984

Separability —An Approach to Physical Database Design

Kyu-Young Whang; Wiederhold; Sagalowicz

A theoretical approach to the optimal design of a large multifile physical database is presented. The design algorithm is based on the theory that, given a set of join methods that satisfy a certain property called separability, the problem of optimal assignment of access structures to the whole database can be reduced to the subproblem of optimizing individual relations independently of one another. Coupling factors are defined to represent all the interactions among the relations. This approach not only reduces the complexity of the problem significantly, but also provides a better understanding of underlying mechanisms.

international conference on management of data | 2006

Continuous query processing in data streams using duality of data and queries

Hyo-Sang Lim; Jae-Gil Lee; Min-Jae Lee; Kyu-Young Whang; Il-Yeol Song

Recent data stream systems such as TelegraphCQ have employed the well-known property of duality between data and queries. In these systems, query processing methods are classified into two dual categories -- data-initiative and query-initiative -- depending on whether query processing is initiated by selecting a data element or a query. Although the duality property has been widely recognized, previous data stream systems do not fully take advantages of this property since they use the two dual methods independently: data-initiative methods only for continuous queries and query-initiative methods only for ad-hoc queries. We contend that continuous query processing can be better optimized by adopting an approach that integrates the two dual methods. Our primary contribution is based on the observation that spatial join is a powerful tool for achieving this objective. In this paper, we first present a new viewpoint of transforming the continuous query processing problem to a multi-dimensional spatial join problem. We then present a continuous query processing algorithm based on spatial join, which we name Spatial Join CQ. This algorithm processes continuous queries by finding the pairs of overlapping regions from a set of data elements and a set of queries, both defined as regions in the multi-dimensional space. The algorithm achieves the advantages of the two dual methods simultaneously. Experimental results show that the proposed algorithm outperforms earlier algorithms by up to 36 times for simple selection continuous queries and by up to 7 times for sliding window join queries.

Data Mining and Knowledge Discovery | 2004

A Subsequence Matching Algorithm that Supports Normalization Transform in Time-Series Databases

Woong-Kee Loh; Sang-Wook Kim; Kyu-Young Whang

In this paper, an algorithm is proposed for subsequence matching that supports normalization transform in time-series databases. Normalization transform enables finding sequences with similar fluctuation patterns even though they are not close to each other before the normalization transform. Simple application of existing subsequence matching algorithms to support normalization transform is not feasible since the algorithms do not have information for normalization transform of subsequences of arbitrary lengths. Application of the existing whole matching algorithm supporting normalization transform to the subsequence matching is feasible, but requires an index for every possible length of the query sequence causing serious overhead on both storage space and update time. The proposed algorithm generates indexes only for a small number of different lengths of query sequences. For subsequence matching it selects the most appropriate index among them. Better search performance can be obtained by using more indexes. In this paper, the approach is called index interpolation. It is formally proved that the proposed algorithm does not cause false dismissal. The search performance can be traded off with storage space by adjusting the number of indexes. For performance evaluation, a series of experiments is conducted using the indexes for only five different lengths out of lengths 256∼512 of the query sequence. The results show that the proposed algorithm outperforms the sequential scan by up to 2.4 times on the average when the selectivity of the query is 10−2 and up to 14.6 times when it is 10−5. Since the proposed algorithm performs better with smaller selectivities, it is suitable for practical situations, where the queries with smaller selectivities are much more frequent.

ACM Transactions on Information Systems | 1987

Office-by-example: an integrated office system and database manager

Kyu-Young Whang; Arthur C. Ammann; Anthony Bolmarcich; Maria Hanrahan; Guy Hochgesang; Kuan-Tsae Huang; Al Khorasani; Ravi Krishnamurthy; Gary H. Sockut; Paula Sweeney; Vance E. Waddle; Moshé M. Zloof

Office-by-Example (OBE) is an integrated office information system that has been under development at IBM Research. OBE, an extension of Query-by-Example, supports various office features such as database tables, word processing, electronic mail, graphics, images, and so forth. These seemingly heterogeneous features are integrated through a language feature called example elements. Applications involving example elements are processed by the database manager, an integrated part of the OBE system. In this paper we describe the facilities and architecture of the OBE system and discuss the techniques for integrating heterogeneous objects.

Information Systems | 1987

Approximating the number of unique values of an attribute without sorting

Morton M. Astrahan; Mario Schkolnick; Kyu-Young Whang

Abstract Counts of unique values are frequently needed information in database systems. Especially, they are essential in query optimization and physical database design. Traditionally, exact counts were obtained by sorting, which is an expensive operation. In this paper we present three algorithms for counting unique values by probabilistic methods. These algorithms require only one pass over the data, and produce approximations to the true count with certain standard deviations. For deviations acceptable in practical environments (~10%), the algorithms require only modest amounts of memory space and computation time. We have implemented all three algorithms in System R. We also present the results of the experiments on accuracy and performance of these algorithms.

Explore More