Shaoxu Song | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shaoxu Song is active.

Explore More

Publication

Featured researches published by Shaoxu Song.

ACM Transactions on Database Systems | 2011

Differential dependencies: Reasoning and discovery

Shaoxu Song; Lei Chen

The importance of difference semantics (e.g., “similar” or “dissimilar”) has been recently recognized for declaring dependencies among various types of data, such as numerical values or text values. We propose a novel form of Differential Dependencies (dds), which specifies constraints on difference, called differential functions, instead of identification functions in traditional dependency notations like functional dependencies. Informally, a differential dependency states that if two tuples have distances on attributes X agreeing with a certain differential function, then their distances on attributes Y should also agree with the corresponding differential function on Y. For example, [date(≤ 7)]→[price(< 100)] states that the price difference of any two days within a week length should be no greater than 100 dollars. Such differential dependencies are useful in various applications, for example, violation detection, data partition, query optimization, record linkage, etc. In this article, we first address several theoretical issues of differential dependencies, including formal definitions of dds and differential keys, subsumption order relation of differential functions, implication of dds, closure of a differential function, a sound and complete inference system, and minimal cover for dds. Then, we investigate a practical problem, that is, how to discover dds and differential keys from a given dataset. Due to the intrinsic hardness, we develop several pruning methods to improve the discovery efficiency in practice. Finally, through an extensive experimental evaluation on real datasets, we demonstrate the discovery performance and the effectiveness of dds in several real applications.

conference on information and knowledge management | 2009

Discovering matching dependencies

Shaoxu Song; Lei Chen

Matching dependencies (MDs) are recently proposed for various data quality applications such as detecting the violation of integrity constraints and duplicate object identification. In this paper, we study the problem of discovering matching dependencies for a given database instance. First, we formally define the measures, support and confidence, for evaluating the utility of MDs in the given database instance. Then, we study the discovery of MDs with certain utility requirements of support and confidence. Exact algorithms are developed, together with pruning strategies to improve the time performance. Finally, our experimental evaluation demonstrates the efficiency of the proposed methods.

very large data bases | 2013

Efficient recovery of missing events

Jianmin Wang; Shaoxu Song; Xiaochen Zhu; Xuemin Lin

For various entering and transmission issues raised by human or system, missing events often occur in event data, which record execution logs of business processes. Without recovering the missing events, applications such as provenance analysis or complex event processing built upon event data are not reliable. Following the minimum change discipline in improving data quality, it is also rational to find a recovery that minimally differs from the original data. Existing recovery approaches fall short of efficiency owing to enumerating and searching over all of the possible sequences of events. In this paper, we study the efficient techniques for recovering missing events. According to our theoretical results, the recovery problem appears to be NP-hard. Nevertheless, advanced indexing, pruning techniques are developed to further improve the recovery efficiency. The experimental results demonstrate that our minimum recovery approach achieves high accuracy, and significantly outperforms the state-of-the-art technique for up to five orders of magnitudes improvement in time performance.

very large data bases | 2014

Repairing vertex labels under neighborhood constraints

Shaoxu Song; Hong Cheng; Jeffrey Xu Yu; Lei Chen

A broad class of data, ranging from similarity networks, workflow networks to protein networks, can be modeled as graphs with data values as vertex labels. The vertex labels (data values) are often dirty for various reasons such as typos or erroneous reporting of results in scientific experiments. Neighborhood constraints, specifying label pairs that are allowed to appear on adjacent vertexes in the graph, are employed to detect and repair erroneous vertex labels. In this paper, we study the problem of repairing vertex labels to make graphs satisfy neighborhood constraints. Unfortunately, the relabeling problem is proved to be NP hard, which motivates us to devise approximation methods for repairing, and identify interesting special cases (star and clique constraints) that can be efficiently solved. We propose several approximate repairing algorithms including greedy heuristics, contraction method and a hybrid approach. The performances of algorithms are also analyzed for the special case. Our extensive experimental evaluation, on both synthetic and real data, demonstrates the effectiveness of eliminating frauds in several types of application networks. Remarkably, the hybrid method performs well in practice, i.e., guarantees termination, while achieving high effectiveness at the same time.

international conference on management of data | 2010

Consistent query answers in inconsistent probabilistic databases

Xiang Lian; Lei Chen; Shaoxu Song

Efficient and effective manipulation of probabilistic data has become increasingly important recently due to many real applications that involve the data uncertainty. This is especially crucial when probabilistic data collected from different sources disagree with each other and incur inconsistencies. In order to accommodate such inconsistencies and enable consistent query answering (CQA), in this paper, we propose the all-possible-repair semantics in the context of inconsistent probabilistic databases, which formalize the repairs on the database as repair worlds via a graph representation. In turn, the CQA problem can be converted into one in the so-called repaired possible worlds (w.r.t. both repair worlds and possible worlds). We investigate a series of consistent queries in inconsistent probabilistic databases, including consistent range queries, join, and top-k queries, which, however, need to deal with an exponential number of the repaired possible worlds at high cost. To tackle the efficiency problem of CQA, in this paper, we propose efficient approaches for retrieving consistent query answers, including effective pruning methods to filter out false positives. Extensive experiments have been conducted to demonstrate the efficiency and effectiveness of our approaches.

very large data bases | 2016

Semantic SPARQL similarity search over RDF knowledge graphs

Weiguo Zheng; Lei Zou; Wei Peng; Xifeng Yan; Shaoxu Song; Dongyan Zhao

RDF knowledge graphs have attracted increasing attentions these years. However, due to the schema-free nature of RDF data, it is very difficult for users to have full knowledge of the underlying schema. Furthermore, the same kind of information can be represented in diverse graph fragments. Hence, it is a huge challenge to formulate complex SPARQL expressions by taking the union of all possible structures. In this paper, we propose an effective framework to access the RDF repository even if users have no full knowledge of the underlying schema. Specifically, given a SPARQL query, the system could return as more answers that match the query based on the semantic similarity as possible. Interestingly, we propose a systematic method to mine diverse semantically equivalent structure patterns. More importantly, incorporating both structural and semantic similarities we are the first to propose a novel similarity measure, semantic graph edit distance. In order to improve the efficiency performance, we apply the semantic summary graph to summarize the knowledge graph, which supports both high-level pruning and drill-down pruning. We also devise an effective lower bound based on the TA-style access to each of the candidate sets. Extensive experiments over real datasets confirm the effectiveness and efficiency of our approach.

international conference on data engineering | 2011

On data dependencies in dataspaces

Shaoxu Song; Lei Chen; Philip S. Yu

To study data dependencies over heterogeneous data in dataspaces, we define a general dependency form, namely comparable dependencies (CDs), which specifies constraints on comparable attributes. It covers the semantics of a broad class of dependencies in databases, including functional dependencies (FDs), metric functional dependencies (MFDs), and matching dependencies (MDs). As we illustrated, comparable dependencies are useful in real practice of dataspaces, e.g., semantic query optimization. Due to the heterogeneous data in dataspaces, the first question, known as the validation problem, is to determine whether a dependency (almost) holds in a data instance. Unfortunately, as we proved, the validation problem with certain error or confidence guarantee is generally hard. In fact, the confidence validation problem is also NP-hard to approximate to within any constant factor. Nevertheless, we develop several approaches for efficient approximation computation, including greedy and randomized approaches with an approximation bound on the maximum number of violations that an object may introduce. Finally, through an extensive experimental evaluation on real data, we verify the superiority of our methods.

international conference on data engineering | 2015

Cleaning structured event logs: A graph repair approach

Jianmin Wang; Shaoxu Song; Xuemin Lin; Xiaochen Zhu; Jian Pei

Event data are often dirty owing to various recording conventions or simply system errors. These errors may cause many serious damages to real applications, such as inaccurate provenance answers, poor profiling results or concealing interesting patterns from event data. Cleaning dirty event data is strongly demanded. While existing event data cleaning techniques view event logs as sequences, structural information do exist among events. We argue that such structural information enhances not only the accuracy of repairing inconsistent events but also the computation efficiency. It is notable that both the structure and the names (labeling) of events could be inconsistent. In real applications, while unsound structure is not repaired automatically (which needs manual effort from business actors to handle the structure error), it is highly desirable to repair the inconsistent event names introduced by recording mistakes. In this paper, we propose a graph repair approach for 1) detecting unsound structure, and 2) repairing inconsistent event name.

international conference on data engineering | 2014

Matching heterogeneous events with patterns

Xiaochen Zhu; Shaoxu Song; Jianmin Wang; Philip S. Yu; Jia-Guang Sun

A large amount of heterogeneous event data are increasingly generated, e.g., in online systems for Web services or operational systems in enterprises. Owing to the difference between event data and traditional relational data, the matching of heterogeneous events is highly non-trivial. While event names are often opaque (e.g., merely with obscure IDs), the existing structure-based matching techniques for relational data also fail to perform owing to the poor discriminative power of dependency relationships between events. We note that interesting patterns exist in the occurrence of events, which may serve as discriminative features in event matching. In this paper, we formalize the problem of matching events with patterns. A generic pattern based matching framework is proposed, which is compatible with the existing structure based techniques. To improve the matching efficiency, we devise several bounds of matching scores for pruning. Since the exploration of patterns is costly and incrementally, our proposed techniques support matching in a pay-as-you-go style, i.e., incrementally update the matching results with the increase of available patterns. Finally, extensive experiments on both real and synthetic data demonstrate the effectiveness of our pattern based matching compared with approaches adapted from existing techniques, and the efficiency improved by the bounding/pruning methods.

international conference on management of data | 2015

SCREEN: Stream Data Cleaning under Speed Constraints

Shaoxu Song; Aoqian Zhang; Jianmin Wang; Philip S. Yu

Stream data are often dirty, for example, owing to unreliable sensor reading, or erroneous extraction of stock prices. Most stream data cleaning approaches employ a smoothing filter, which may seriously alter the data without preserving the original information. We argue that the cleaning should avoid changing those originally correct/clean data, a.k.a. the minimum change principle in data cleaning. To capture the knowledge about what is clean, we consider the (widely existing) constraints on the speed of data changes, such as fuel consumption per hour, or daily limit of stock prices. Guided by these semantic constraints, in this paper, we propose SCREEN, the first constraint-based approach for cleaning stream data. It is notable that existing data repair techniques clean (a sequence of) data as a whole and fail to support stream computation. To this end, we have to relax the global optimum over the entire sequence to the local optimum in a window. Rather than the commonly observed NP-hardness of general data repairing problems, our major contributions include: (1) polynomial time algorithm for global optimum, (2) linear time algorithm towards local optimum under an efficient Median Principle,(3) support on out-of-order arrivals of data points, and(4) adaptive window size for balancing repair accuracy and efficiency. Experiments on real datasets demonstrate that SCREEN can show significantly higher repair accuracy than the existing approaches such as smoothing.

Explore More