Is this you? Create Your Porfile

Peng Cai

East China Normal University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Peng Cai is active.

Explore More

Publication

Featured researches published by Peng Cai.

international acm sigir conference on research and development in information retrieval | 2010

Learning to rank only using training data from related domain

Wei Gao; Peng Cai; Kam-Fai Wong; Aoying Zhou

Like traditional supervised and semi-supervised algorithms, learning to rank for information retrieval requires document annotations provided by domain experts. It is costly to annotate training data for different search domains and tasks. We propose to exploit training data annotated for a related domain to learn to rank retrieved documents in the target domain, in which no labeled data is available. We present a simple yet effective approach based on instance-weighting scheme. Our method first estimates the importance of each related-domain document relative to the target domain. Then heuristics are studied to transform the importance of individual documents to the pairwise weights of document pairs, which can be directly incorporated into the popular ranking algorithms. Due to importance weighting, ranking model trained on related domain is highly adaptable to the data of target domain. Ranking adaptation experiments on LETOR3.0 dataset [27] demonstrate that with a fair amount of related-domain training data, our method significantly outperforms the baseline without weighting, and most of time is not significantly worse than an ideal model directly trained on target domain.

international acm sigir conference on research and development in information retrieval | 2011

Relevant knowledge helps in choosing right teacher: active query selection for ranking adaptation

Peng Cai; Wei Gao; Aoying Zhou; Kam-Fai Wong

Learning to adapt in a new setting is a common challenge to our knowledge and capability. New life would be easier if we actively pursued supervision from the right mentor chosen with our relevant but limited prior knowledge. This variant principle of active learning seems intuitively useful to many domain adaptation problems. In this paper, we substantiate its power for advancing automatic ranking adaptation, which is important in web search since its prohibitive to gather enough labeled data for every search domain for fully training domain-specific rankers. For the cost-effectiveness, it is expected that only those most informative instances in target domain are collected to annotate while we can still utilize the abundant ranking knowledge in source domain. We propose a unified ranking framework to mutually reinforce the active selection of informative target-domain queries and the appropriate weighting of source training data as related prior knowledge. We select to annotate those target queries whose documents order most disagrees among the members of a committee built on the mixture of source training data and the already selected target data. Then the replenished labeled set is used to adjust the importance of source queries for enhancing their rank transfer. This procedure iterates until labeling budget exhausts. Based on LETOR3.0 and Yahoo! Learning to Rank Challenge data sets, our approach significantly outperforms the random query annotation commonly used in ranking adaptation and the active rank learner on target-domain data only.

web age information management | 2010

Semantic entity detection by integrating CRF and SVM

Peng Cai; Hangzai Luo; Aoying Zhou

Semantic entity detection is very important for extracting and representing the abundant semantic information of multimedia documents. In comparison with other media, e.g. video, image and audio, text expresses semantics more directly and often serves as a bridge in cross-media analysis. However, semantic entity detection from text is still a difficult problem because of the complexity of natural language. In this paper, we propose a novel framework which takes the advantages of both CRF (conditional random fields) and SVM (support vector machines), and present its application to semantic entity detection. Using this framework, context features are represented as the probability of entity boundary and extracted via CRF, and then linguistic and statistical features are extracted via large-scale text document analysis. Finally, all extracted features are integrated and used to perform the classification. As our algorithm systematically integrates the context, linguistic and statistical features, it may outperform traditional algorithms that only adopt part of the features.

web age information management | 2016

Low Overhead Log Replication for Main Memory Database System

Jinwei Guo; Chendong Zhang; Peng Cai; Minqi Zhou; Aoying Zhou

Log replication is the key component of high available database system. To guarantee data consistency and reliability, modern database systems often use Paxos protocol to replicate log in multiple database instance sites. Since the replicated logs need to contain some metadata such as committed log sequence number (LSN), this increases the overhead of storage and network. It has significantly negative impact on the throughput in the update intensive work load. In this paper, we present an implementation of log replication and database recovery, which adopts the idea of piggybacking, i.e. committed LSN is embedded in the commit logs. This practice not only retains virtues of Paxos replication, but also reduces disk and network IO effectively, which enhances performance and decreases recovery time. We implemented and evaluated our approach in a main memory database system (Oceanbase), and found that our method can offer 1.3x higher throughput than traditional log replication with synchronization mechanism.

australasian database conference | 2015

Detecting Spamming Groups in Social Media Based on Latent Graph

Qunyan Zhang; Chi Zhang; Peng Cai; Weining Qian; Aoying Zhou

Spammers in microblogging services aim to disseminate unuseful or misleading information, which leads to poor user experience and negative impact on the ecosystem of social media platform. Individual spammer detection, based on content and social network information, has been proposed to alleviate this predicament. However, most of the time spamming behavior is collaboratively conducted by a group of users, referred to as spamming group. In this paper, we propose to detect spamming groups in microblogging services. At the first step, we proposed RP-LDA to extract user features and find user groups within which users share similar retweeting behavior. Then, the degrees of individual users that are spammers are calculated by using a semi-supervised label propagation procedure. Finally, we determine the spamming groups using mixed membership distribution of users. Empirical studies over a real-life dataset demonstrate the effectiveness of our method and show that it can outperform the baseline.

european conference on information retrieval | 2011

Weight-based boosting model for cross-domain relevance ranking adaptation

Peng Cai; Wei Gao; Kam-Fai Wong; Aoying Zhou

Adaptation techniques based on importance weighting were shown effective for RankSVM and RankNet, viz., each training instance is assigned a target weight denoting its importance to the target domain and incorporated into loss functions. In this work, we extend RankBoost using importance weighting framework for ranking adaptation. We find it non-trivial to incorporate the target weight into the boosting-based ranking algorithms because it plays a contradictory role against the innate weight of boosting, namely source weight that focuses on adjusting source-domain ranking accuracy. Our experiments show that among three variants, the additive weight-based RankBoost, which dynamically balances the two types of weights, significantly and consistently outperforms the baseline trained directly on the source domain.

database systems for advanced applications | 2011

AUCWeb: A prototype for analyzing user-created web data

Weining Qian; Feng Chen; Juan Du; Wei Ming Zhang; Can Zhang; Haixin Ma; Peng Cai; Minqi Zhou; Aoying Zhou

In this demonstration, we present a prototype system, called AUCWeb, that is designed for analyzing user-created web data. It has novel features in that 1) it may utilize external resources for semantic annotation on low-quality user-created content, and 2) it provides a descriptive language for definition of analytical tasks. Both internal mechanism and the usage of AUCWeb for building advanced applications are to be shown in the demonstration.

conference on multimedia modeling | 2010

Semantic entity-relationship model for large-scale multimedia news exploration and recommendation

Hangzai Luo; Peng Cai; Wei Gong; Jianping Fan

Even though current news websites use large amount of multimedia materials including image, video and audio, the multimedia materials are used as supplementary to the traditional text-based framework. As users always prefer multimedia, the traditional text-based news exploration interface receives more and more criticisms from both journalists and general audiences. To resolve this problem, we propose a novel framework for multimedia news exploration and analysis. The proposed framework adopts our semantic entity-relationship model to model the multimedia semantics. The proposed semantic entity-relationship model has three nice properties. First, it is able to model multimedia semantics with visual, audio and text properties in a uniform framework. Second, it can be extracted via existing semantic analysis and machine learning algorithms. Third, it is easy to implement sophisticated information mining and visualization algorithms based on the model. Based on this model, we implemented a novel multimedia news exploration and analysis system by integrating visual analytics and information mining techniques. Our system not only provides higher efficiency on news exploration and retrieval but also reveals extra interesting information that is not available on traditional news exploration systems.

social informatics | 2011

Towards high-quality semantic entity detection over online forums

Juan Du; Wei Ming Zhang; Peng Cai; Linling Ma; Weining Qian; Aoying Zhou

User-generated content (UGC) implies user-behaviors. Mining on such data helps understanding the relationship between social media and the real world. Howevr, UGC is usually of low quality, which results in the diffculty of semantic entity extraction. In this paper, we propose a method towards high-quality semantic entity refinement on forums by employing external resources. Experiments on real-life Chinese online forums show the effectiveness of our method.

database systems for advanced applications | 2018

Efficient Snapshot Isolation in Paxos-Replicated Database Systems

Jinwei Guo; Peng Cai; Bing Xiao; Weining Qian; Aoying Zhou

Modern database systems are increasingly deployed in a cluster of commodity machines with Paxos-based replication technique to offer better performance, higher availability and fault-tolerance. The widely adopted implementation is that one database replica is elected to be a leader and to be responsible for transaction requests. After the transaction execution is completed, the leader generates transaction log and commit this transaction until the log has been replicated to a majority of replicas. The state of the leader is always ahead of that of the follower replicas since the leader commits the transactions firstly and then notifies other replicas of the latest committed log entries in the later communication. As the follower replica can’t immediately provide the latest snapshot, both read-write and read-only transactions would be executed at the leader to guarantee the strong snapshot isolation semantic. In this work, we design and implement an efficient snapshot isolation scheme. This scheme uses adaptive timestamp allocation to avoid frequently requesting the leader to assign transaction timestamps. Furthermore, we design an early log replay mechanism for follower replicas. It allows the follower replica to execute a read operation without waiting to replay log to generate the required snapshot. Comparing with the conventional implementation, we experimentally show that the optimized snapshot isolation for Paxos-replicated database systems has better performance in terms of scalability and throughput.

Explore More