Shuai Ma | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shuai Ma is active.

Explore More

Publication

Featured researches published by Shuai Ma.

very large data bases | 2010

Graph pattern matching: from intractable to polynomial time

Wenfei Fan; Shuai Ma; Nan Tang; Yinghui Wu; Yunpeng Wu

Graph pattern matching is typically defined in terms of subgraph isomorphism, which makes it an np-complete problem. Moreover, it requires bijective functions, which are often too restrictive to characterize patterns in emerging applications. We propose a class of graph patterns, in which an edge denotes the connectivity in a data graph within a predefined number of hops. In addition, we define matching based on a notion of bounded simulation, an extension of graph simulation. We show that with this revision, graph pattern matching can be performed in cubic-time, by providing such an algorithm. We also develop algorithms for incrementally finding matches when data graphs are updated, with performance guarantees for dag patterns. We experimentally verify that these algorithms scale well, and that the revised notion of graph pattern matching allows us to identify communities commonly found in real-world networks.

very large data bases | 2009

Reasoning about record matching rules

Wenfei Fan; Xibei Jia; Shuai Ma

To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n2) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.

very large data bases | 2010

Towards certain fixes with editing rules and master data

Wenfei Fan; Shuai Ma; Nan Tang; Wenyuan Yu

A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find certain fixes that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of certain regions, and a class of editing rules. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple, relative to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.

international conference on data engineering | 2011

Adding regular expressions to graph reachability and pattern queries

Wenfei Fan; Shuai Ma; Nan Tang; Yinghui Wu

It is increasingly common to find graphs in which edges bear different types, indicating a variety of relationships. For such graphs we propose a class of reachability queries and a class of graph patterns, in which an edge is specified with a regular expression of a certain form, expressing the connectivity in a data graph via edges of various types. In addition, we define graph pattern matching based on a revised notion of graph simulation. On graphs in emerging applications such as social networks, we show that these queries are capable of finding more sensible information than their traditional counterparts. Better still, their increased expressive power does not come with extra complexity. Indeed, (1) we investigate their containment and minimization problems, and show that these fundamental problems are in quadratic time for reachability queries and are in cubic time for pattern queries. (2) We develop an algorithm for answering reachability queries, in quadratic time as for their traditional counterpart. (3) We provide two cubic-time algorithms for evaluating graph pattern queries based on extended graph simulation, as opposed to the NP-completeness of graph pattern matching via subgraph isomorphism. (4) The effectiveness, efficiency and scalability of these algorithms are experimentally verified using real-life data and synthetic data.

very large data bases | 2010

Graph homomorphism revisited for graph matching

Wenfei Fan; Shuai Ma; Hongzhi Wang; Yinghui Wu

In a variety of emerging applications one needs to decide whether a graph G matches another Gp, i.e., whether G has a topological structure similar to that of Gp. The traditional notions of graph homomorphism and isomorphism often fall short of capturing the structural similarity in these applications. This paper studies revisions of these notions, providing a full treatment from complexity to algorithms. (1) We propose p-homomorphism (p-hom) and 1-1 p-hom, which extend graph homomorphism and subgraph isomorphism, respectively, by mapping edges from one graph to paths in another, and by measuring the similarity of nodes. (2) We introduce metrics to measure graph similarity, and several optimization problems for p-hom and 1-1 p-hom. (3) We show that the decision problems for p-hom and 1-1 p-hom are NP-complete even for DAGs, and that the optimization problems are approximation-hard. (4) Nevertheless, we provide approximation algorithms with provable guarantees on match quality. We experimentally verify the effectiveness of the revised notions and the efficiency of our algorithms in Web site matching, using real-life and synthetic data.

very large data bases | 2011

Capturing topology in graph pattern matching

Shuai Ma; Yang Cao; Wenfei Fan; Jinpeng Huai; Tianyu Wo

Graph pattern matching is often defined in terms of subgraph isomorphism, an np-complete problem. To lower its complexity, various extensions of graph simulation have been considered instead. These extensions allow pattern matching to be conducted in cubic-time. However, they fall short of capturing the topology of data graphs, i.e., graphs may have a structure drastically different from pattern graphs they match, and the matches found are often too large to understand and analyze. To rectify these problems, this paper proposes a notion of strong simulation, a revision of graph simulation, for graph pattern matching. (1) We identify a set of criteria for preserving the topology of graphs matched. We show that strong simulation preserves the topology of data graphs and finds a bounded number of matches. (2) We show that strong simulation retains the same complexity as earlier extensions of simulation, by providing a cubic-time algorithm for computing strong simulation. (3) We present the locality property of strong simulation, which allows us to effectively conduct pattern matching on distributed graphs. (4) We experimentally verify the effectiveness and efficiency of these algorithms, using real-life data and synthetic data.

very large data bases | 2011

Dynamic constraints for record matching

Wenfei Fan; Hong Gao; Xibei Jia; Shuai Ma

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are developed for record matching, and are defined in terms of similarity predicates and a dynamic semantics. (b) We identify a special case of mds, referred to as relative candidate keys (rcks), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring mds, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. Moreover, we develop a sound and complete system for inferring mds. (d) We provide a quadratic-time algorithm for inferring mds and an effective algorithm for deducing a set of high-quality rcks from mds. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing and in addition, that the md-based techniques effectively improve the quality and efficiency of various record matching methods.

international conference on data engineering | 2008

Increasing the Expressivity of Conditional Functional Dependencies without Extra Complexity

Loreto Bravo; Wenfei Fan; Floris Geerts; Shuai Ma

The paper proposes an extension of CFDs [1], referred to as extended Conditional Functional Dependencies (eCFDs). In contrast to CFDs, eCFDs specify patterns of semantically related values in terms of disjunction and inequality, and are capable of catching inconsistencies that arise in practice but cannot be detected by CFDs. The increase in expressive power does not incur extra complexity: we show that the satisfiability and implication analyses of eCFDs remain NP - complete and coNP -complete, respectively, the same as their CFDs counterparts. In light of the intractability, we present an algorithm that approximates the maximum number of eCFDs that are satisfiable. In addition, we revise SQL techniques for detecting CFD violations, and show that violations of multiple eCFDs can be captured via a single pair of SQL queries. We also introduce an incremental SQL technique for detecting eCFD violations in response to database updates. We experimentally verify the effectiveness and efficiency of our SQL -based detection methods.

acm/ieee international conference on mobile computing and networking | 2014

Enhancing reliability to boost the throughput over screen-camera links

Anran Wang; Shuai Ma; Chunming Hu; Jinpeng Huai; Chunyi Peng; Guobin Shen

With the rapid proliferation of camera-equipped smart devices (e.g., smartphones, pads, tablets), visible light communication (VLC) over screen-camera links emerges as a novel form of near-field communication. Such communication via smart devices is highly competitive for its user-friendliness, security, and infrastructure-less (i.e., no dependency on WiFi or cellular infrastructure). However, existing approaches mostly focus on improving the transmission speed and ignore the transmission reliability. Considering the interplay between the transmission speed and reliability towards effective end-to-end communication, in this paper, we aim to boost the throughput over screen-camera links by enhancing the transmission reliability. To this end, we propose RDCode, a robust dynamic barcode which enables a novel packet-frame-block structure. Based on the layered structure, we design different error correction schemes at three levels: intra-blocks, inter-blocks and inter-frames, in order to verify and recover the lost blocks and frames. Finally, we implement RDCode and experimentally show that RDCode reaches a high level of transmission reliability (e.g., reducing the error rate to 10%) and yields a at least doubled transmission rate, compared with the existing state-of-the-art approach COBRA.

international world wide web conferences | 2012

Distributed graph pattern matching

Shuai Ma; Yang Cao; Jinpeng Huai; Tianyu Wo

Graph simulation has been adopted for pattern matching to reduce the complexity and capture the need of novel applications. With the rapid development of the Web and social networks, data is typically distributed over multiple machines. Hence a natural question raised is how to evaluate graph simulation on distributed data. To our knowledge, no such distributed algorithms are in place yet. This paper settles this question by providing evaluation algorithms and optimizations for graph simulation in a distributed setting. (1) We study the impacts of components and data locality on the evaluation of graph simulation. (2) We give an analysis of a large class of distributed algorithms, captured by a message-passing model, for graph simulation. We also identify three complexity measures: visit times, makespan and data shipment, for analyzing the distributed algorithms, and show that these measures are essentially controversial with each other. (3) We propose distributed algorithms and optimization techniques that exploit the properties of graph simulation and the analyses of distributed algorithms. (4) We experimentally verify the effectiveness and efficiency of these algorithms, using both real-life and synthetic data.

Explore More