Chaofeng Sha
Fudan University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chaofeng Sha.
conference on information and knowledge management | 2003
Cheqing Jin; Weining Qian; Chaofeng Sha; Jeffrey Xu Yu; Aoying Zhou
It is challenge to maintain frequent items over a data stream, with a small bounded memory, in a dynamic environment where both insertion/deletion of items are allowed. In this paper, we propose a new novel algorithm, called hCount, which can handle both insertion and deletion of items with a much less memory space than the best reported algorithm. Our algorithm is also superior in terms of precision, recall and processing time. In addition, our approach does not request the preknowledge on the size of range for a data stream, and can handle range extension dynamically. Given a little modification, algorithm hCount can be improved to hCount*, which even owns significantly better performance than before.
international conference on data engineering | 2007
Aoying Zhou; Feng Cao; Ying Yan; Chaofeng Sha; Xiaofeng He
Clustering data streams has been attracting a lot of research efforts recently. However, this problem has not received enough consideration when the data streams are generated in a distributed fashion, whereas such a scenario is very common in real life applications. There exist constraining factors in clustering the data streams in the distributed environment: the data records generated are noisy or incomplete due to the unreliable distributed system; the system needs to on-line process a huge volume of data; the communication is potentially a bottleneck of the system. All these factors pose great challenge for clustering the distributed data streams. In this paper, we proposed an EM-based (Expectation Maximization) framework to effectively cluster the distributed data streams, with the above fundamental challenges in mind. In the presence of noisy or incomplete data records, our algorithms learn the distribution of underlying data streams by maximizing the likelihood of the data clusters. A test-and-cluster strategy is proposed to reduce the average processing cost, which is especially effective for online clustering over large data streams. Our extensive experimental studies show that the proposed algorithms can achieve a high accuracy with less communication cost, memory consumption and CPU time.
web age information management | 2010
Peisen Yuan; Chaofeng Sha; Xiaoling Wang; Bin Yang; Aoying Zhou; Su Yang
XML is a de-facto standard for web data exchange and information representation. Efficient management of these large volumes of XML data brings challenges to conventional technique. To cope with large scale data, MapReduce computing framework as an efficient solution has attracted more and more attention in the database community recently. In this paper, an efficient and scalable framework is proposed for XML structural similarity search on large cluster with MapReduce. First, sub-structures of XML structure are extracted from large XML corpus located on a large cluster in parallel. Then Min-Hashing and locality sensitive hashing techniques are developed on the distributed and parallel computing framework for efficient structural similarity search processing. An empirical study on the cluster with real large datasets demonstrates the effectiveness and efficiency of our approach.
database systems for advanced applications | 2011
Xiuxia Tian; Chaofeng Sha; Xiaoling Wang; Aoying Zhou
Database as a Service(DaaS) is a paradigm for data management in which the Database Service Provider(DSP), usually a professional third party for data management, can host the database as a service. Many security and query problems are brought about because of the possible untrusted or malicious DSP in this context. Most of the proposed papers are concentrated on using symmetric encryption to guarantee the confidentiality of the delegated data, and using partition based index to help execute the privacy preserving range query. However, encryption and decryption operations on large volume of data are time consuming, and query results always consist of many irrelevant data tuples. Different from encryption based scheme, in this paper, we present a secret share based scheme to guarantee the confidentiality of delegated data. And what is more important, we construct a privacy preserving index to accelerate query and to help return the exactly required data tuples. Finally we analyze the security properties and demonstrate the efficiency and query response time of our approach through empirical data.
web age information management | 2010
Xiaoling Wang; Zhen Xu; Chaofeng Sha; Martin Ester; Aoying Zhou
The problem of classification from positive and unlabeled examples attracts much attention currently. However, when the number of unlabeled negative examples is very small, the effectiveness of former work has been decreased. This paper propose an effective approach to address this problem, and we firstly use entropy to selects the likely positive and negative examples to build a complete training set; and then logistic regression classifier is applied on this new training set for classification. A series of experiments are conducted. The experimental results illustrate that the proposed approach outperforms previous work in the literature.
asia-pacific web conference | 2013
Wenzhe Yu; Rong Zhang; Xiaofeng He; Chaofeng Sha
Online product reviews provide helpful information for user decision-making. However, since user-generated reviews proliferate in recent years, it is critical to deal with the information overload in e-commerce sites. In this paper, we propose an approach to select a small set of representative reviews for each product, which shall consider both the attribute coverage and opinion diversity under the requirement of providing high quality reviews. First, we assign weights to each attribute, which measure the attribute importance and help realize useful review selection; second, we cluster reviews into different groups representing different concerns which lead to better diversification results especially for selecting smaller sets of reviews; finally, we perform a set of experiments on real datasets to verify our ideas.
asia-pacific web conference | 2013
Liang Feng; Leyi Song; Chaofeng Sha; Xueqing Gong
Most of large web-based development communities require a bug tracking system to keep track of various bug reports. However, duplicate bug reports tend to result in waste of resources, and may cause potential conflicts. There have been two types of works focusing on this problem: relevant bug report retrieval [8][11][10][13] and duplicate bug report identification [5][12]. The former methods can achieve high accuracy (82%) in the top 10 results in some dataset, but they do not really reduce the workload of developers. The latter methods still need further improvement on the performance.
international world wide web conferences | 2012
Rong Zhang; Chaofeng Sha; Minqi Zhou; Aoying Zhou
Analysis to product reviews has attracted great attention from both academia and industry. Generally the evaluation scores of reviews are used to generate the average scores of products and shops for future potential users. However, in the real world, there is the inconsistency problem between the evaluation scores and review content, and some customers do not give out fair reviews. In this work, we focus on detecting the credibility of customers by analyzing online shopping and review behaviors, and then we re-score the reviews for products and shops. In the end, we evaluate our algorithm based on the real data set from Taobao, the biggest E-commerce site in China.
asia pacific web conference | 2006
Tao Xie; Chaofeng Sha; Xiaoling Wang; Aoying Zhou
With the development of XML applications, such as Digital Library, XML subscribe/publish system, and other XML repositories, top-k structural similarity search over XML documents is attracting more attention. The similarity of two XML documents can be measured by using the edit distance defined between XML trees in previous work. Since the computation of edit distances is time consuming, some recent work presented some approaches to calculate edit distance by using structural summaries to improve the algorithm performance. However, most existing algorithms for calculating edit distance between trees ignore the fact that nodes in a tree may be of different significance, and the same edit operation costs are assumed inappropriately for all nodes in XML document tree. This paper addresses this problem by proposing a summary structure which could be used to make the tree-based edit distance more rational; furthermore, a novel weighting scheme is proposed to indicate that some nodes are more important than others with respect for structural similarity. We introduce a new cost model for computing structural distance and takes weight information into account for nodes in distance computation in this paper. Compared with former techniques, our approach can approximately answer the top-k queries efficiently. We verify this approach through a series of experiments, and the results show that using weighted structural summaries for top-k queries is efficient and practical.
Frontiers of Computer Science in China | 2015
Rong Zhang; Wenzhe Yu; Chaofeng Sha; Xiaofeng He; Aoying Zhou
Currently, there are many online review web sites where consumers can freely write comments about different kinds of products and services. These comments are quite useful for other potential consumers. However, the number of online comments is often large and the number continues to grow as more and more consumers contribute. In addition, one comment may mention more than one product and contain opinions about different products, mentioning something good and something bad. However, they share only a single overall score. Therefore, it is not easy to know the quality of an individual product from these comments.This paper presents a novel approach to generate review summaries including scores and description snippets with respect to each individual product. From the large number of comments, we first extract the context (snippet) that includes a description of the products and choose those snippets that express consumer opinions on them. We then propose several methods to predict the rating (from 1 to 5 stars) of the snippets. Finally, we derive a generic framework for generating summaries from the snippets. We design a new snippet selection algorithm to ensure that the returned results preserve the opinion-aspect statistical properties and attribute-aspect coverage based on a standard seat allocation algorithm. Through experimentswe demonstrate empirically that our methods are effective. We also quantitatively evaluate each step of our approach.