Bolin Ding | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Bolin Ding is active.

Explore More

Publication

Featured researches published by Bolin Ding.

international conference on data engineering | 2007

Finding Top-k Min-Cost Connected Trees in Databases

Bolin Ding; J. Xu Yu; Shan Wang; Lu Qin; Xiao Zhang; Xuemin Lin

It is widely realized that the integration of database and information retrieval techniques will provide users with a wide range of high quality services. In this paper, we study processing an l-keyword query, p1, p1, ..., pl, against a relational database which can be modeled as a weighted graph, G(V, E). Here V is a set of nodes (tuples) and E is a set of edges representing foreign key references between tuples. Let Vi ⊆ V be a set of nodes that contain the keyword pi. We study finding top-k minimum cost connected trees that contain at least one node in every subset Vi, and denote our problem as GST-k When k = 1, it is known as a minimum cost group Steiner tree problem which is NP-complete. We observe that the number of keywords, l, is small, and propose a novel parameterized solution, with l as a parameter, to find the optimal GST-1, in time complexity O(3ln + 2l ((l + logn)n + m)), where n and m are the numbers of nodes and edges in graph G. Our solution can handle graphs with a large number of nodes. Our GST-1 solution can be easily extended to support GST-k, which outperforms the existing GST-k solutions over both weighted undirected/directed graphs. We conducted extensive experimental studies, and report our finding.

very large data bases | 2010

Swarm: mining relaxed temporal moving object clusters

Zhenhui Li; Bolin Ding; Jiawei Han; Roland Kays

Recent improvements in positioning technology make massive moving object data widely available. One important analysis is to find the moving objects that travel together. Existing methods put a strong constraint in defining moving object cluster, that they require the moving objects to stick together for consecutive timestamps. Our key observation is that the moving objects in a cluster may actually diverge temporarily and congregate at certain timestamps. Motivated by this, we propose the concept of swarm which captures the moving objects that move within arbitrary shape of clusters for certain timestamps that are possibly non-consecutive. The goal of our paper is to find all discriminative swarms, namely closed swarm. While the search space for closed swarms is prohibitively huge, we design a method, ObjectGrowth, to efficiently retrieve the answer. In ObjectGrowth, two effective pruning strategies are proposed to greatly reduce the search space and a novel closure checking rule is developed to report closed swarms on-the-fly. Empirical studies on the real data as well as large synthetic data demonstrate the effectiveness and efficiency of our methods.

knowledge discovery and data mining | 2010

Mining periodic behaviors for moving objects

Zhenhui Li; Bolin Ding; Jiawei Han; Roland Kays; Peter Nye

Periodicity is a frequently happening phenomenon for moving objects. Finding periodic behaviors is essential to understanding object movements. However, periodic behaviors could be complicated, involving multiple interleaving periods, partial time span, and spatiotemporal noises and outliers. In this paper, we address the problem of mining periodic behaviors for moving objects. It involves two sub-problems: how to detect the periods in complex movement, and how to mine periodic movement behaviors. Our main assumption is that the observed movement is generated from multiple interleaved periodic behaviors associated with certain reference locations. Based on this assumption, we propose a two-stage algorithm, Periodica, to solve the problem. At the first stage, the notion of observation spot is proposed to capture the reference locations. Through observation spots, multiple periods in the movement can be retrieved using a method that combines Fourier transform and autocorrelation. At the second stage, a probabilistic model is proposed to characterize the periodic behaviors. For a specific period, periodic behaviors are statistically generalized from partial movement sequences through hierarchical clustering. Empirical studies on both synthetic and real data sets demonstrate the effectiveness of our method.

extending database technology | 2008

Finding time-dependent shortest paths over large graphs

Bolin Ding; Jeffrey Xu Yu; Lu Qin

The spatial and temporal databases have been studied widely and intensively over years. In this paper, we study how to answer queries of finding the best departure time that minimizes the total travel time from a place to another, over a road network, where the traffic conditions dynamically change from time to time. We study a generalized form of this problem, called the time-dependent shortest-path problem. A time-dependent graph GT is a graph that has an edge-delay function, wi, j(t), associated with each edge (vi, vj), to be stored in a database. The edge-delay function wi, j(t) specifies how much time it takes to travel from node vi to node vj, if it departs from vi at time t. A user-specified query is to ask the minimum-travel-time path, from a source node, vs, to a destination node, ve, over the time-dependent graph, GT, with the best departure time to be selected from a time interval T. We denote this user query as LTT(vs, ve, T) over GT. The challenge of this problem is the added complexity due to the time dependency in the time-dependent graph. That is, edge delays are not constants, and can vary from time to time. In this paper, we propose a novel algorithm to find the minimum-travel-time path with the best departure time for a LTT(vs, ve, T) query over a large graph GT. Our approach outperforms existing algorithms in terms of both time complexity in theory and efficiency in practice. We will discuss the design of our algorithm, together with its correctness and complexity. We conducted extensive experimental studies over large graphs and will report our findings.

international conference on data engineering | 2008

Fast Graph Pattern Matching

Jiefeng Cheng; Jeffrey Xu Yu; Bolin Ding; Philip S. Yu; Haixun Wang

Due to rapid growth of the Internet technology and new scientific/technological advances, the number of applications that model data as graphs increases, because graphs have high expressive power to model complicated structures. The dominance of graphs in real-world applications asks for new graph data management so that users can access graph data effectively and efficiently. In this paper, we study a graph pattern matching problem over a large data graph. The problem is to find all patterns in a large data graph that match a user-given graph pattern. We propose a new two-step R-join (reachability join) algorithm with filter step and fetch step based on a cluster- based join-index with graph codes. We consider the filter step as an R-semijoin, and propose a new optimization approach by interleaving R-joins with R-semijoins. We conducted extensive performance studies, and confirm the efficiency of our proposed new approaches.

very large data bases | 2011

Distance-constraint reachability computation in uncertain graphs

Ruoming Jin; Lin Liu; Bolin Ding; Haixun Wang

Driven by the emerging network applications, querying and mining uncertain graphs has become increasingly important. In this paper, we investigate a fundamental problem concerning uncertain graphs, which we call the distance-constraint reachability (DCR) problem: Given two vertices s and t, what is the probability that the distance from s to t is less than or equal to a user-defined threshold d in the uncertain graph? Since this problem is #P-Complete, we focus on efficiently and accurately approximating DCR online. Our main results include two new estimators for the probabilistic reachability. One is a Horvitz-Thomson type estimator based on the unequal probabilistic sampling scheme, and the other is a novel recursive sampling estimator, which effectively combines a deterministic recursive computational procedure with a sampling process to boost the estimation accuracy. Both estimators can produce much smaller variance than the direct sampling estimator, which considers each trial to be either 1 or 0. We also present methods to make these estimators more computationally efficient. The comprehensive experiment evaluation on both real and synthetic datasets demonstrates the efficiency and accuracy of our new estimators.

international conference on data mining | 2008

Text Cube: Computing IR Measures for Multidimensional Text Database Analysis

Cindy Xinde Lin; Bolin Ding; Jiawei Han; Feida Zhu; Bo Zhao

Since Jim Gray introduced the concept of rdquodata cuberdquo in 1997, data cube, associated with online analytical processing (OLAP), has become a driving engine in data warehouse industry. Because the boom of Internet has given rise to an ever increasing amount of text data associated with other multidimensional information, it is natural to propose a data cube model that integrates the power of traditional OLAP and IR techniques for text. In this paper, we propose a text-cube model on multidimensional text database and study effective OLAP over such data. Two kinds of hierarchies are distinguishable inside: dimensional hierarchy and term hierarchy. By incorporating these hierarchies, we conduct systematic studies on efficient text-cube implementation, OLAP execution and query processing. Our performance study shows the high promise of our methods.

database systems for advanced applications | 2007

Twiglist: make twig pattern matching fast

Lu Qin; Jeffrey Xu Yu; Bolin Ding

Twig pattern matching problem has been widely studied in recent years. Give an XML tree τ. A twig-pattern matching query, Q, represented as a query tree, is to find all the occurrences of such twig pattern in τ. Previous works like HolisticTwig and TJFast decomposed the twig pattern into single paths from root to leaves, and merged all the occurrences of such path-patterns to find the occurrences of the twig-pattern matching query, Q. Their techniques can effectively prune impossible path-patterns to avoid producing a large amount of intermediate results. But they still need to merge path-patterns which occurs high computational cost. Recently, Twig2Stack was proposed to overcome this problem using hierarchical-stacks to further reduce the merging cost. But, due to the complex hierarchical-stacks Twig2Stack used, Twig2Stack may end up many random accesses in memory, and need to load the whole XML tree into memory in the worst case. In this paper, we propose a new algorithm, called TwigList, which uses simple lists. Both time and space complexity of our algorithm are linear with respect to the total number of pattern occurrences and the size of XML tree. In addition, our algorithm can be easily modified as an external algorithm. We conducted extensive experimental studies using large benchmark and real datasets. Our algorithm significantly outperforms the up-to-date algorithm.

international conference on data engineering | 2009

Efficient Mining of Closed Repetitive Gapped Subsequences from a Sequence Database

Bolin Ding; David Lo; Jiawei Han; Siau-Cheng Khoo

There is a huge wealth of sequence data available, for example, customer purchase histories, program execution traces, DNA, and protein sequences. Analyzing this wealth of data to mine important knowledge is certainly a worthwhile goal.In this paper, as a step forward to analyzing patterns in sequences, we introduce the problem of mining closed repetitive gapped subsequences and propose efficient solutions. Given a database of sequences where each sequence is an ordered list of events, the pattern we would like to mine is called repetitive gapped subsequence, which is a subsequence (possibly with gaps between two successive events within it) of some sequences in the database. We introduce the concept of repetitive support to measure how frequently a pattern repeats in the database. Different from the sequential pattern mining problem, repetitive support captures not only repetitions of a pattern in different sequences but also the repetitions within a sequence. Given a userspecified support threshold min_sup, we study finding the set of all patterns with repetitive support no less than min_sup. To obtain a compact yet complete result set and improve the efficiency, we also study finding closed patterns. Efficient mining algorithms to find the complete set of desired patterns are proposed based on the idea of instance growth. Our performance study on various datasets shows the efficiency of our approach. A case study is also performed to show the utility of our approach.

IEEE Internet Computing | 2008

Classifying Data Streams with Skewed Class Distributions and Concept Drifts

Jing Gao; Bolin Ding; Wei Fan; Jiawei Han; Philip S. Yu

Classification is an important data analysis tool that uses a model built from historical data to predict class labels for new observations. More and more applications are featuring data streams, rather than finite stored data sets, which are a challenge for traditional classification algorithms. Concept drifts and skewed distributions, two common properties of data stream applications, make the task of learning in streams difficult. The authors aim to develop a new approach to classify skewed data streams that uses an ensemble of models to match the distribution over under-samples of negatives and repeated samples of positives.

Explore More