Is this you? Create Your Porfile

Wook-Shin Han

Pohang University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Wook-Shin Han is active.

Explore More

Publication

Featured researches published by Wook-Shin Han.

international conference on management of data | 2002

General match: a subsequence matching method in time-series databases based on generalized windows

Yang-Sae Moon; Kyu-Young Whang; Wook-Shin Han

We generalize the method of constructing windows in subsequence matching. By this generalization, we can explain earlier subsequence matching methods as special cases of a common framework. Based on the generalization, we propose a new subsequence matching method, General Match. The earlier work by Faloutsos et al. (called FRM for convenience) causes a lot of false alarms due to lack of point-filtering effect. Dual Match, recently proposed as a dual approach of FRM, improves performance significantly over FRM by exploiting point filtering effect. However, it has the problem of having a smaller allowable window size---half that of FRM---given the minimum query length. A smaller window increases false alarms due to window size effect. General Match offers advantages of both methods: it can reduce window size effect by using large windows like FRM and, at the same time, can exploit point-filtering effect like Dual Match. General Match divides data sequences into generalized sliding windows (J-sliding windows) and the query sequence into generalized disjoint windows (J-disjoint windows). We formally prove that General Match is correct, i.e., it incurs no false dismissal. We then propose a method of estimating the optimal value of the sliding factor J that minimizes the number of page accesses. Experimental results for real stock data show that, for low selectivities (10-6∼10-4), General Match improves average performance by 117% over Dual Match and by 998% over FRM; for high selectivities (10-3∼10-1), by 45% over Dual Match and by 64% over FRM. The proposed generalization provides an excellent theoretical basis for understanding the underlying mechanisms of subsequence matching.

international conference on management of data | 2013

Turbo iso : towards ultrafast and robust subgraph isomorphism search in large graph databases

Wook-Shin Han; Jinsoo Lee; Jeong-Hoon Lee

Given a query graph q and a data graph g, the subgraph isomorphism search finds all occurrences of q in g and is considered one of the most fundamental query types for many real applications. While this problem belongs to NP-hard, many algorithms have been proposed to solve it in a reasonable time for real datasets. However, a recent study has shown, through an extensive benchmark with various real datasets, that all existing algorithms have serious problems in their matching order selection. Furthermore, all algorithms blindly permutate all possible mappings for query vertices, often leading to useless computations. In this paper, we present an efficient and robust subgraph search solution, called TurboISO, which is turbo-charged with two novel concepts, candidate region exploration and the combine and permute strategy (in short, Comb/Perm). The candidate region exploration identifies on-the-fly candidate subgraphs (i.e, candidate regions), which contain embeddings, and computes a robust matching order for each candidate region explored. The Comb/Perm strategy exploits the novel concept of the neighborhood equivalence class (NEC). Each query vertex in the same NEC has identically matching data vertices. During subgraph isomorphism search, Comb/Perm generates only combinations for each NEC instead of permutating all possible enumerations. Thus, if a chosen combination is determined to not contribute to a complete solution, all possible permutations for that combination will be safely pruned. Extensive experiments with many real datasets show that TurboISO consistently and significantly outperforms all competitors by up to several orders of magnitude.

very large data bases | 2008

Parallelizing query optimization

Wook-Shin Han; Wooseong Kwak; Jinsoo Lee; Guy M. Lohman; Volker Markl

Many commercial RDBMSs employ cost-based query optimization exploiting dynamic programming (DP) to efficiently generate the optimal query execution plan. However, optimization time increases rapidly for queries joining more than 10 tables. Randomized or heuristic search algorithms reduce query optimization time for large join queries by considering fewer plans, sacrificing plan optimality. Though commercial systems executing query plans in parallel have existed for over a decade, the optimization of such plans still occurs serially. While modern microprocessors employ multiple cores to accelerate computations, parallelizing query optimization to exploit multi-core parallelism is not as straightforward as it may seem. The DP used in join enumeration belongs to the challenging nonserial polyadic DP class because of its non-uniform data dependencies. In this paper, we propose a comprehensive and practical solution for parallelizing query optimization in the multi-core processor architecture, including a parallel join enumeration algorithm and several alternative ways to allocate work to threads to balance their load. We also introduce a novel data structure called skip vector array to significantly reduce the generation of join partitions that are infeasible. This solution has been prototyped in PostgreSQL. Extensive experiments using various query graph topologies confirm that our algorithms allocate the work evenly, thereby achieving almost linear speed-up. Our parallel join enumeration algorithm enhanced with our skip vector array outperforms the conventional generate-and-filter DP algorithm by up to two orders of magnitude for star queries-linear speedup due to parallelism and an order of magnitude performance improvement due to the skip vector array.

international world wide web conferences | 2007

Mapping-driven XML transformation

Haifeng Jiang; Howard Ho; Lucian Popa; Wook-Shin Han

Clio is an existing schema-mapping tool that provides user-friendly means to manage and facilitate the complex task of transformation and integration of heterogeneous data such as XML over the Web or in XML databases. By means of mappings from source to target schemas, Clio can help users conveniently establish the precise semantics of data transformation and integration. In this paper we study the problem of how to efficiently implement such data transformation (i.e., generating target data from the source data based on schema mappings). We present a three-phase framework for high-performance XML-to-XML transformation based on schema mappings, and discuss methodologies and algorithms for implementing these phases. In particular, we elaborate on novel techniques such as streamed extraction of mapped source values and scalable disk-based merging of overlapping data (including duplicate elimination). We compare our transformation framework with alternative methods such as using XQuery or SQL/XML provided by current commercial databases. The results demonstrate that the three-phase framework (although as simple as it is) is highly scalable and outperforms the alternative methods by orders of magnitude.

international conference on data engineering | 2005

Odysseus: a high-performance ORDBMS tightly-coupled with IR features

Kyu-Young Whang; Min-Jae Lee; Jae-Gil Lee; Min-Soo Kim; Wook-Shin Han

We propose the notion of tight-coupling [K. Whang et al., (1999)] to add new data types into the DBMS engine. In this paper, we introduce the Odysseus ORDBMS and present its tightly-coupled IR features (US patented). We demonstrate a Web search engine capable of managing 20 million Web pages in a non-parallel configuration using Odysseus.

international conference on management of data | 2014

OPT: a new framework for overlapped and parallel triangulation in large-scale graphs

Jin-Ha Kim; Wook-Shin Han; Sangyeon Lee; Kyungyeol Park; Hwanjo Yu

Graph triangulation, which finds all triangles in a graph, has been actively studied due to its wide range of applications in the network analysis and data mining. With the rapid growth of graph data size, disk-based triangulation methods are in demand but little researched. To handle a large-scale graph which does not fit in memory, we must iteratively load small parts of the graph. In the existing literature, achieving the ideal cost has been considered to be impossible for billion-scale graphs due to the memory size constraint. In this paper, we propose an overlapped and parallel disk-based triangulation framework for billion-scale graphs, OPT, which achieves the ideal cost by (1) full overlap of the CPU and I/O operations and (2) full parallelism of multi-core CPU and FlashSSD I/O. In OPT, triangles in memory are called the internal triangles while triangles constituting vertices in memory and vertices in external memory are called the external triangles. At the macro level, OPT overlaps the internal triangulation and the external triangulation, while it overlaps the CPU and I/O operations at the micro level. Thereby, the cost of OPT is close to the ideal cost. Moreover, OPT instantiates both vertex-iterator and edge-iterator models and benefits from multi-thread parallelism on both types of triangulation. Extensive experiments conducted on large-scale datasets showed that (1) OPT achieved the elapsed time close to that of the ideal method with less than 7% of overhead under the limited memory budget, (2) OPT achieved linear speed-up with an increasing number of CPU cores, (3) OPT outperforms the state-of-the-art parallel method by up to an order of magnitude with 6 CPU cores, and (4) for the first time in the literature, the triangulation results are reported for a billion-vertex scale real-world graph.

Distributed and Parallel Databases | 2015

The G* graph database: efficiently managing large distributed dynamic graphs

Alan G. Labouseur; Jeremy Birnbaum; Paul W. Olsen; Sean R. Spillane; Jayadevan Vijayan; Jeong-Hyon Hwang; Wook-Shin Han

From sensor networks to transportation infrastructure to social networks, we are awash in data. Many of these real-world networks tend to be large (“big data”) and dynamic, evolving over time. Their evolution can be modeled as a series of graphs. Traditional systems that store and analyze one graph at a time cannot effectively handle the complexity and subtlety inherent in dynamic graphs. Modern analytics require systems capable of storing and processing series of graphs. We present such a system. G* compresses dynamic graph data based on commonalities among the graphs in the series for deduplicated storage on multiple servers. In addition to the obvious space-saving advantage, large-scale graph processing tends to be I/O bound, so faster reads from and writes to stable storage enable faster results. Unlike traditional database and graph processing systems, G* executes complex queries on large graphs using distributed operators to process graph data in parallel. It speeds up queries on multiple graphs by processing graph commonalities only once and sharing the results across relevant graphs. This architecture not only provides scalability, but since G* is not limited to processing only what is available in RAM, its analysis capabilities are far greater than other systems which are limited to what they can hold in memory. This paper presents G*’s design and implementation principles along with evaluation results that document its unique benefits over traditional graph processing systems.

IEEE Transactions on Knowledge and Data Engineering | 2003

Dynamic buffer allocation in video-on-demand systems

Sang Ho Lee; Kyu-Young Whang; Yang-Sae Moon; Wook-Shin Han; Il-Yeol Song

In video-on-demand (VOD) systems, as the size of the buffer allocated to user requests increases, initial latency and memory requirements increase. Hence, the buffer size must be minimized. The existing static buffer allocation scheme, however, determines the buffer size based on the assumption that the system is in the fully loaded state. Thus, when the system is in a partially loaded state, the scheme allocates a buffer larger than necessary to a user request. This paper proposes a dynamic buffer allocation scheme that allocates to user requests buffers of the minimum size in a partially loaded state, as well as in the fully loaded state. The inherent difficulty in determining the buffer size in the dynamic buffer allocation scheme is that the size of the buffer currently being allocated is dependent on the number of and the sizes of the buffers to be allocated in the next service period. We solve this problem by the predict-and-enforce strategy, where we predict the number and the sizes of future buffers based on inertia assumptions and enforce these assumptions at runtime. Any violation of these assumptions is resolved by deferring service to the violating new user request until the assumptions are satisfied. Since the size of the current buffer is dependent on the sizes of the future buffers, it is represented by a recurrence equation. We provide a solution to this equation, which can be computed at the system initialization time for runtime efficiency. We have performed extensive analysis and simulation. The results show that the dynamic buffer allocation scheme reduces initial latency (averaged over the number of user requests in service from one to the maximum capacity) to 1/29.4 /spl sim/ 1/11.0 of that for the static one and, by reducing the memory requirement, increases the number of concurrent user requests to 2.36 /spl sim/ 3.25 times that of the static one when averaged over the amount of system memory available. These results demonstrate that the dynamic buffer allocation scheme significantly improves the performance and capacity of VOD systems.

international conference on management of data | 2007

Progressive optimization in a shared-nothing parallel database

Wook-Shin Han; Jack Hon Wai Ng; Volker Markl; Holger Kache; Mokhtar Kandil

Commercial enterprise data warehouses are typically implemented on parallel databases due to the inherent scalability and performance limitation of a serial architecture. Queries used in such large data warehouses can contain complex predicates as well as multiple joins, and the resulting query execution plans generated by the optimizer may be sub-optimal due to mis-estimates of row cardinalities. Progressive optimization (POP) is an approach to detect cardinality estimation errors by monitoring actual cardinalities at run-time and to recover by triggering re-optimization with the actual cardinalities measured. However, the original serial POP solution is based on a serial processing architecture, and the core ideas cannot be readily applied to a parallel shared-nothing environment. Extending the serial POP to a parallel environment is a challenging problem since we need to determine when and how we can trigger re-optimization based on cardinalities collected from multiple independent nodes. In this paper, we present a comprehensive and practical solution to this problem, including several novel voting schemes whether to trigger re-optimization, a mechanism to reuse local intermediate results across nodes as a partitioned materialized view, several flavors of parallel checkpoint operators, and parallel checkpoint processing methods using efficient communication protocols. This solution has been prototyped in a leading commercial parallel DBMS. We have performed extensive experiments using the TPC-H benchmark and a real-world database. Experimental results show that our solution has negligible runtime overhead and accelerates the performance of complex OLAP queries by up to a factor of 22.

very large data bases | 2015

Taming subgraph isomorphism for RDF query processing

Jin-Ha Kim; Hyungyu Shin; Wook-Shin Han; Sungpack Hong; Hassan Chafi

RDF data are used to model knowledge in various areas such as life sciences, Semantic Web, bioinformatics, and social graphs. The size of real RDF data reaches billions of triples. This calls for a framework for efficiently processing RDF data. The core function of processing RDF data is subgraph pattern matching. There have been two completely different directions for supporting efficient subgraph pattern matching. One direction is to develop specialized RDF query processing engines exploiting the properties of RDF data for the last decade, while the other direction is to develop efficient subgraph isomorphism algorithms for general, labeled graphs for over 30 years. Although both directions have a similar goal (i.e., finding subgraphs in data graphs for a given query graph), they have been independently researched without clear reason. We argue that a subgraph isomorphism algorithm can be easily modified to handle the graph homomorphism, which is the RDF pattern matching semantics, by just removing the injectivity constraint. In this paper, based on the state-of-the-art subgraph isomorphism algorithm, we propose an in-memory solution, TurboHOM++, which is tamed for the RDF processing, and we compare it with the representative RDF processing engines for several RDF benchmarks in a server machine where billions of triples can be loaded in memory. In order to speed up TurboHOM++, we also provide a simple yet effective transformation and a series of optimization techniques. Extensive experiments using several RDF benchmarks show that TurboHOM++ consistently and significantly outperforms the representative RDF engines. Specifically, TurboHOM++ outperforms its competitors by up to five orders of magnitude.

Explore More