Featured Researches

Databases

Efficient Discovery of Approximate Order Dependencies

Order dependencies (ODs) capture relationships between ordered domains of attributes. Approximate ODs (AODs) capture such relationships even when there exist exceptions in the data. During automated discovery of ODs, validation is the process of verifying whether an OD holds. We present an algorithm for validating approximate ODs with significantly improved runtime performance over existing methods for AODs, and prove that it is correct and has optimal runtime. By replacing the validation step in a leading algorithm for approximate OD discovery with ours, we achieve orders-of-magnitude improvements in performance.

Read more
Databases

Efficient Mining of Frequent Subgraphs with Two-Vertex Exploration

Frequent Subgraph Mining (FSM) is the key task in many graph mining and machine learning applications. Numerous systems have been proposed for FSM in the past decade. Although these systems show good performance for small patterns (with no more than four vertices), we found that they have difficulty in mining larger patterns. In this work, we propose a novel two-vertex exploration strategy to accelerate the mining process. Compared with the single-vertex exploration adopted by previous systems, our two-vertex exploration avoids the large memory consumption issue and significantly reduces the memory access overhead. We further enhance the performance through an index-based quick pattern technique that reduces the overhead of isomorphism checks, and a subgraph sampling technique that mitigates the issue of subgraph explosion. The experimental results show that our system achieves significant speedups against the state-of-the-art graph pattern mining systems and supports larger pattern mining tasks that none of the existing systems can handle.

Read more
Databases

Efficient Oblivious Database Joins

A major algorithmic challenge in designing applications intended for secure remote execution is ensuring that they are oblivious to their inputs, in the sense that their memory access patterns do not leak sensitive information to the server. This problem is particularly relevant to cloud databases that wish to allow queries over the client's encrypted data. One of the major obstacles to such a goal is the join operator, which is non-trivial to implement obliviously without resorting to generic but inefficient solutions like Oblivious RAM (ORAM). We present an oblivious algorithm for equi-joins which (up to a logarithmic factor) matches the optimal O(nlogn) complexity of the standard non-secure sort-merge join (on inputs producing O(n) outputs). We do not use use expensive primitives like ORAM or rely on unrealistic hardware or security assumptions. Our approach, which is based on sorting networks and novel provably-oblivious constructions, is conceptually simple, easily verifiable, and very efficient in practice. Its data-independent algorithmic structure makes it secure in various different settings for remote computation, even in those that are known to be vulnerable to certain side-channel attacks (such as Intel SGX) or with strict requirements for low circuit complexity (like secure multiparty computation). We confirm that our approach is easily realizable through a compact implementation which matches our expectations for performance and is shown, both formally and empirically, to possess the desired security characteristics.

Read more
Databases

Efficient Radial Pattern Keyword Search on Knowledge Graphs in Parallel

Recently, keyword search on Knowledge Graphs (KGs) becomes popular. Typical keyword search approaches aim at finding a concise subgraph from a KG, which can reflect a close relationship among all input keywords. The connection paths between keywords are selected in a way that leads to a result subgraph with a better semantic score. However, such a result may not meet user information need because it relies on the scoring function to decide what keywords to link closer. Therefore, such a result may miss close connections among some keywords on which users intend to focus. In this paper, we propose a parallel keyword search engine, called RAKS. It allows users to specify a query as two sets of keywords, namely central keywords and marginal keywords. Specifically, central keywords are those keywords on which users focus more. Their relationships are desired in the results. Marginal keywords are those less focused keywords. Their connections to the central keywords are desired. In addition, they provide additional information that helps discover better results in terms of user intents. To improve the efficiency, we propose novel weighting and scoring schemes that boost the parallel execution during search while retrieving semantically relevant results. We conduct extensive experiments to validate that RAKS can work efficiently and effectively on open KGs with large size and variety.

Read more
Databases

Efficient Semi-External Depth-First Search

Computing Depth-First Search (DFS) results, i.e. depth-first order or DFS-Tree, on the semi-external environment becomes a hot topic, because the scales of the graphs grow rapidly which can hardly be hold in the main memory, in the big data era. Existing semi-external DFS algorithms assume the main memory could, at least, hold a spanning tree T of a graph G, and gradually restructure T into a DFS-Tree, which is non-trivial. In this paper, we present a comprehensive study of semi-external DFS problem, including the first theoretical analysis of the main challenge of this problem, as far as we know. Besides, we introduce a new semi-external DFS algorithm with an efficient edge pruning principle, named EP-DFS. Unlike the traditional algorithms, we not only focus on addressing such complex problem efficiently with less I/Os, but also focus on that with simpler CPU calculation (Implementation-friendly) and less random I/O access (key-to-efficiency). The former is based on our efficient pruning principle; the latter is addressed by a lightweight index N+-index, which is a compressed storage for a subset of the edges for G. The extensive experimental evaluation on both synthetic and real graphs confirms that our EP-DFS algorithm outperforms the existing techniques.

Read more
Databases

Efficient Suspected Infected Crowds Detection Based on Spatio-Temporal Trajectories

Virus transmission from person to person is an emergency event facing the global public. Early detection and isolation of potentially susceptible crowds can effectively control the epidemic of its disease. Existing metrics can not correctly address the infected rate on trajectories. To solve this problem, we propose a novel spatio-temporal infected rate (IR) measure based on human moving trajectories that can adequately describe the risk of being infected by a given query trajectory of a patient. Then, we manage source data through an efficient spatio-temporal index to make our system more scalable, and can quickly query susceptible crowds from massive trajectories. Besides, we design several pruning strategies that can effectively reduce calculations. Further, we design a spatial first time (SFT) index, which enables us to quickly query multiple trajectories without much I/O consumption and data redundancy. The performance of the solutions is demonstrated in experiments based on real and synthetic trajectory datasets that have shown the effectiveness and efficiency of our solutions.

Read more
Databases

Efficient Trajectory Compression and Queries

Nowadays, there are ubiquitousness of GPS sensors in various devices collecting, storing and transmitting tremendous trajectory data. However, an unprecedented scale of GPS data has posed an urgent demand for not only an effective storage mechanism but also an efficient query mechanism. Line simplification in online mode, a kind of commonly used trajectory compression methods in practice, plays an important role to attack this issue. To attack this issue, in this paper, each compressed trajectory is regarded as a sequence of continuous line segments, but not discrete points. And based on this, we propose a new trajectory similarity metric AL, an efficient index \emph{ASP-tree} and two algorithms about how to process range queries and top- k similarity queries on the compressed trajectories.

Read more
Databases

Efficient and Effective ER with Progressive Blocking

Blocking is a mechanism to improve the efficiency of Entity Resolution (ER) which aims to quickly prune out all non-matching record pairs. However, depending on the distributions of entity cluster sizes, existing techniques can be either (a) too aggressive, such that they help scale but can adversely affect the ER effectiveness, or (b) too permissive, potentially harming ER efficiency. In this paper, we propose a new methodology of progressive blocking (pBlocking) to enable both efficient and effective ER, which works seamlessly across different entity cluster size distributions. pBlocking is based on the insight that the effectiveness-efficiency trade-off is revealed only when the output of ER starts to be available. Hence, pBlocking leverages partial ER output in a feedback loop to refine the blocking result in a data-driven fashion. Specifically, we bootstrap pBlocking with traditional blocking methods and progressively improve the building and scoring of blocks until we get the desired trade-off, leveraging a limited amount of ER results as a guidance at every round. We formally prove that pBlocking converges efficiently ( O(nlo g 2 n) time complexity, where n is the total number of records). Our experiments show that incorporating partial ER output in a feedback loop can improve the efficiency and effectiveness of blocking by 5x and 60% respectively, improving the overall F-score of the entire ER process up to 60%.

Read more
Databases

Efficient and Effective Similar Subtrajectory Search with Deep Reinforcement Learning

Similar trajectory search is a fundamental problem and has been well studied over the past two decades. However, the similar subtrajectory search (SimSub) problem, aiming to return a portion of a trajectory (i.e., a subtrajectory) which is the most similar to a query trajectory, has been mostly disregarded despite that it could capture trajectory similarity in a finer-grained way and many applications take subtrajectories as basic units for analysis. In this paper, we study the SimSub problem and develop a suite of algorithms including both exact and approximate ones. Among those approximate algorithms, two that are based on deep reinforcement learning stand out and outperform those non-learning based algorithms in terms of effectiveness and efficiency. We conduct experiments on real-world trajectory datasets, which verify the effectiveness and efficiency of the proposed algorithms.

Read more
Databases

Efficiently Finding a Maximal Clique Summary via Effective Sampling

Maximal clique enumeration (MCE) is a fundamental problem in graph theory and is used in many applications, such as social network analysis, bioinformatics, intelligent agent systems, cyber security, etc. Most existing MCE algorithms focus on improving the efficiency rather than reducing the output size. The output unfortunately could consist of a large number of maximal cliques. In this paper, we study how to report a summary of less overlapping maximal cliques. The problem was studied before, however, after examining the pioneer approach, we consider it still not satisfactory. To advance the research along this line, our paper attempts to make four contributions: (a) we propose a more effective sampling strategy, which produces a much smaller summary but still ensures that the summary can somehow witness all the maximal cliques and the expectation of each maximal clique witnessed by the summary is above a predefined threshold; (b) we prove that the sampling strategy is optimal under certain optimality conditions; (c) we apply clique-size bounding and design new enumeration order to approach the optimality conditions; and (d) to verify experimentally, we test eight real benchmark datasets that have a variety of graph characteristics. The results show that our new sampling strategy consistently outperforms the state-of-the-art approach by producing smaller summaries and running faster on all the datasets.

Read more

Ready to get started?

Join us today