Chaofeng Sha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chaofeng Sha is active.

Explore More

Publication

Featured researches published by Chaofeng Sha.

conference on information and knowledge management | 2003

Dynamically maintaining frequent items over a data stream

Cheqing Jin; Weining Qian; Chaofeng Sha; Jeffrey Xu Yu; Aoying Zhou

It is challenge to maintain frequent items over a data stream, with a small bounded memory, in a dynamic environment where both insertion/deletion of items are allowed. In this paper, we propose a new novel algorithm, called hCount, which can handle both insertion and deletion of items with a much less memory space than the best reported algorithm. Our algorithm is also superior in terms of precision, recall and processing time. In addition, our approach does not request the preknowledge on the size of range for a data stream, and can handle range extension dynamically. Given a little modification, algorithm hCount can be improved to hCount*, which even owns significantly better performance than before.

international conference on data engineering | 2007

Distributed Data Stream Clustering: A Fast EM-based Approach

Aoying Zhou; Feng Cao; Ying Yan; Chaofeng Sha; Xiaofeng He

Clustering data streams has been attracting a lot of research efforts recently. However, this problem has not received enough consideration when the data streams are generated in a distributed fashion, whereas such a scenario is very common in real life applications. There exist constraining factors in clustering the data streams in the distributed environment: the data records generated are noisy or incomplete due to the unreliable distributed system; the system needs to on-line process a huge volume of data; the communication is potentially a bottleneck of the system. All these factors pose great challenge for clustering the distributed data streams. In this paper, we proposed an EM-based (Expectation Maximization) framework to effectively cluster the distributed data streams, with the above fundamental challenges in mind. In the presence of noisy or incomplete data records, our algorithms learn the distribution of underlying data streams by maximizing the likelihood of the data clusters. A test-and-cluster strategy is proposed to reduce the average processing cost, which is especially effective for online clustering over large data streams. Our extensive experimental studies show that the proposed algorithms can achieve a high accuracy with less communication cost, memory consumption and CPU time.

web age information management | 2010

XML structural similarity search using mapreduce

Peisen Yuan; Chaofeng Sha; Xiaoling Wang; Bin Yang; Aoying Zhou; Su Yang

XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.

database systems for advanced applications | 2011

Privacy preserving query processing on secret share based data storage

Xiuxia Tian; Chaofeng Sha; Xiaoling Wang; Aoying Zhou

Database as a Service(DaaS) is a paradigm for data management in which the Database Service Provider(DSP), usually a professional third party for data management, can host the database as a service. Many security and query problems are brought about because of the possible untrusted or malicious DSP in this context. Most of the proposed papers are concentrated on using symmetric encryption to guarantee the confidentiality of the delegated data, and using partition based index to help execute the privacy preserving range query. However, encryption and decryption operations on large volume of data are time consuming, and query results always consist of many irrelevant data tuples. Different from encryption based scheme, in this paper, we present a secret share based scheme to guarantee the confidentiality of delegated data. And what is more important, we construct a privacy preserving index to accelerate query and to help return the exactly required data tuples. Finally we analyze the security properties and demonstrate the efficiency and query response time of our approach through empirical data.

web age information management | 2010

Semi-supervised learning from only positive and unlabeled data using entropy

Xiaoling Wang; Zhen Xu; Chaofeng Sha; Martin Ester; Aoying Zhou

The problem of classification from positive and unlabeled examples attracts much attention currently. However, when the number of unlabeled negative examples is very small, the effectiveness of former work has been decreased. This paper propose an effective approach to address this problem, and we firstly use entropy to selects the likely positive and negative examples to build a complete training set; and then logistic regression classifier is applied on this new training set for classification. A series of experiments are conducted. The experimental results illustrate that the proposed approach outperforms previous work in the literature.

asia-pacific web conference | 2013

Selecting a Diversified Set of Reviews

Wenzhe Yu; Rong Zhang; Xiaofeng He; Chaofeng Sha

Online product reviews provide helpful information for user decision-making. However, since user-generated reviews proliferate in recent years, it is critical to deal with the information overload in e-commerce sites. In this paper, we propose an approach to select a small set of representative reviews for each product, which shall consider both the attribute coverage and opinion diversity under the requirement of providing high quality reviews. First, we assign weights to each attribute, which measure the attribute importance and help realize useful review selection; second, we cluster reviews into different groups representing different concerns which lead to better diversification results especially for selecting smaller sets of reviews; finally, we perform a set of experiments on real datasets to verify our ideas.

asia-pacific web conference | 2013

Practical Duplicate Bug Reports Detection in a Large Web-Based Development Community

Liang Feng; Leyi Song; Chaofeng Sha; Xueqing Gong

Most of large web-based development communities require a bug tracking system to keep track of various bug reports. However, duplicate bug reports tend to result in waste of resources, and may cause potential conflicts. There have been two types of works focusing on this problem: relevant bug report retrieval [8][11][10][13] and duplicate bug report identification [5][12]. The former methods can achieve high accuracy (82%) in the top 10 results in some dataset, but they do not really reduce the workload of developers. The latter methods still need further improvement on the performance.

international world wide web conferences | 2012

Exploiting shopping and reviewing behavior to re-score online evaluations

Rong Zhang; Chaofeng Sha; Minqi Zhou; Aoying Zhou

Analysis to product reviews has attracted great attention from both academia and industry. Generally the evaluation scores of reviews are used to generate the average scores of products and shops for future potential users. However, in the real world, there is the inconsistency problem between the evaluation scores and review content, and some customers do not give out fair reviews. In this work, we focus on detecting the credibility of customers by analyzing online shopping and review behaviors, and then we re-score the reviews for products and shops. In the end, we evaluate our algorithm based on the real data set from Taobao, the biggest E-commerce site in China.

asia pacific web conference | 2006

Approximate top-k structural similarity search over XML documents

Tao Xie; Chaofeng Sha; Xiaoling Wang; Aoying Zhou

With the development of XML applications, such as Digital Library, XML subscribe/publish system, and other XML repositories, top-k structural similarity search over XML documents is attracting more attention. The similarity of two XML documents can be measured by using the edit distance defined between XML trees in previous work. Since the computation of edit distances is time consuming, some recent work presented some approaches to calculate edit distance by using structural summaries to improve the algorithm performance. However, most existing algorithms for calculating edit distance between trees ignore the fact that nodes in a tree may be of different significance, and the same edit operation costs are assumed inappropriately for all nodes in XML document tree. This paper addresses this problem by proposing a summary structure which could be used to make the tree-based edit distance more rational; furthermore, a novel weighting scheme is proposed to indicate that some nodes are more important than others with respect for structural similarity. We introduce a new cost model for computing structural distance and takes weight information into account for nodes in distance computation in this paper. Compared with former techniques, our approach can approximately answer the top-k queries efficiently. We verify this approach through a series of experiments, and the results show that using weighted structural summaries for top-k queries is efficient and practical.

Frontiers of Computer Science in China | 2015

Product-oriented review summarization and scoring

Rong Zhang; Wenzhe Yu; Chaofeng Sha; Xiaofeng He; Aoying Zhou

Currently, there are many online review web sites where consumers can freely write comments about different kinds of products and services. These comments are quite useful for other potential consumers. However, the number of online comments is often large and the number continues to grow as more and more consumers contribute. In addition, one comment may mention more than one product and contain opinions about different products, mentioning something good and something bad. However, they share only a single overall score. Therefore, it is not easy to know the quality of an individual product from these comments.This paper presents a novel approach to generate review summaries including scores and description snippets with respect to each individual product. From the large number of comments, we first extract the context (snippet) that includes a description of the products and choose those snippets that express consumer opinions on them. We then propose several methods to predict the rating (from 1 to 5 stars) of the snippets. Finally, we derive a generic framework for generating summaries from the snippets. We design a new snippet selection algorithm to ensure that the returned results preserve the opinion-aspect statistical properties and attribute-aspect coverage based on a standard seat allocation algorithm. Through experimentswe demonstrate empirically that our methods are effective. We also quantitatively evaluate each step of our approach.

Explore More