Sihong Xie | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sihong Xie is active.

Explore More

Publication

Featured researches published by Sihong Xie.

ACM Transactions on Intelligent Systems and Technology | 2012

Identify Online Store Review Spammers via Social Review Graph

Guan Wang; Sihong Xie; Bing Liu; Philip S. Yu

Online shopping reviews provide valuable information for customers to compare the quality of products, store services, and many other aspects of future purchases. However, spammers are joining this community trying to mislead consumers by writing fake or unfair reviews to confuse the consumers. Previous attempts have used reviewers’ behaviors such as text similarity and rating patterns, to detect spammers. These studies are able to identify certain types of spammers, for instance, those who post many similar reviews about one target. However, in reality, there are other kinds of spammers who can manipulate their behaviors to act just like normal reviewers, and thus cannot be detected by the available techniques. In this article, we propose a novel concept of review graph to capture the relationships among all reviewers, reviews and stores that the reviewers have reviewed as a heterogeneous graph. We explore how interactions between nodes in this graph could reveal the cause of spam and propose an iterative computation model to identify suspicious reviewers. In the review graph, we have three kinds of nodes, namely, reviewer, review, and store. We capture their relationships by introducing three fundamental concepts, the trustiness of reviewers, the honesty of reviews, and the reliability of stores, and identifying their interrelationships: a reviewer is more trustworthy if the person has written more honesty reviews; a store is more reliable if it has more positive reviews from trustworthy reviewers; and a review is more honest if many other honest reviews support it. This is the first time such intricate relationships have been identified for spam detection and captured in a graph model. We further develop an effective computation method based on the proposed graph model. Different from any existing approaches, we do not use an review text information. Our model is thus complementary to existing approaches and able to find more difficult and subtle spamming activities, which are agreed upon by human judges after they evaluate our results.

web search and data mining | 2014

Inferring the impacts of social media on crowdfunding

Chun Ta Lu; Sihong Xie; Xiangnan Kong; Philip S. Yu

Crowdfunding -- in which people can raise funds through collaborative contributions of general public (i.e., crowd) -- has emerged as a billion dollars business for supporting more than one million ventures. However, very few research works have examined the process of crowdfunding. In particular, none has studied how social networks help crowdfunding projects to succeed. To gain insights into the effects of social networks in crowdfunding, we analyze the hidden connections between the fundraising results of projects on crowdfunding websites and the corresponding promotion campaigns in social media. Our analysis considers the dynamics of crowdfunding from two aspects: how fundraising activities and promotional activities on social media simultaneously evolve over time, and how the promotion campaigns influence the final outcomes. From our investigation, we identify a number of important principles that provide a useful guide for devising effective campaigns. For example, we observe temporal distribution of customer interest, strong correlations between a crowdfunding projects early promotional activities and the final outcomes, and the importance of concurrent promotion from multiple sources. We then show that these discoveries can help predict several important quantities, including overall popularity and the success rate of the project. Finally, we show how to use these discoveries to help design crowdfunding sites.

siam international conference on data mining | 2014

Future Influence Ranking of Scientific Literature

Senzhang Wang; Sihong Xie; Xiaoming Zhang; Zhoujun Li; Philip S. Yu; Xinyu Shu

Researchers or students entering a emerging research area are particularly interested in what newly published papers will be most cited and which young researchers will become influential in the future, so that they can catch the most recent advances and find valuable research directions. However, predicting the future importance of scientific articles and authors is challenging due to the dynamic nature of literature networks and evolving research topics. Different from most previous studies aiming to rank the current importance of literatures and authors, we focus on \emph{ranking the future popularity of new publications and young researchers} by proposing a unified ranking model to combine various available information. Specifically, we first propose to extract two kinds of text features, words and words co-occurrence to characterize innovative papers and authors. Then, instead of using static and un-weighted graphs, we construct time-aware weighted graphs to distinguish the various importance of links established at different time. Finally, by leveraging both the constructed text features and graphs, we propose a mutual reinforcement ranking framework called \emph{MRFRank} to rank the future importance of papers and authors simultaneously. Experimental results on the ArnetMiner dataset show that the proposed approach significantly outperforms the baselines on the metric \emph{recommendation intensity}.

siam international conference on data mining | 2016

Identifying connectivity patterns for brain diseases via multi-side-view guided deep architectures

Jingyuan Zhang; Bokai Cao; Sihong Xie; Chun Ta Lu; Philip S. Yu; Ann B. Ragin

There is considerable interest in mining neuroimage data to discover clinically meaningful connectivity patterns to inform an understanding of neurological and neuropsychiatric disorders. Subgraph mining models have been used to discover connected subgraph patterns. However, it is difficult to capture the complicated interplay among patterns. As a result, classification performance based on these results may not be satisfactory. To address this issue, we propose to learn non-linear representations of brain connectivity patterns from deep learning architectures. This is non-trivial, due to the limited subjects and the high costs of acquiring the data. Fortunately, auxiliary information from multiple side views such as clinical, serologic, immunologic, cognitive and other diagnostic testing also characterizes the states of subjects from different perspectives. In this paper, we present a novel Multi-side-View guided AutoEncoder (MVAE) that incorporates multiple side views into the process of deep learning to tackle the bias in the construction of connectivity patterns caused by the scarce clinical data. Extensive experiments show that MVAE not only captures discriminative connectivity patterns for classification, but also discovers meaningful information for clinical interpretation.

international world wide web conferences | 2016

HeteroSales: Utilizing Heterogeneous Social Networks to Identify the Next Enterprise Customer

Qingbo Hu; Sihong Xie; Jiawei Zhang; Qiang Zhu; Songtao Guo; Philip S. Yu

Nowadays, a modern e-commerce company may have both online sales and offline sales departments. Normally, online sales attempt to sell in small quantities to individual customers through broadcasting a large amount of emails or promotion codes, which heavily rely on the designed backend algorithms. Offline sales, on the other hand, try to sell in much larger quantities to enterprise customers through contacts initiated by sales representatives, which are more costly compared to online sales. Unlike many previous research works focusing on machine learning algorithms to support online sales, this paper introduces an approach that utilizes heterogenous social networks to improve the effectiveness of offline sales. More specifically, we propose a two-phase framework, HeteroSales, which first constructs a company-to-company graph, a.k.a. Company Homophily Graph (CHG), from semantics based meta-path learning, and then adopts label propagation on the graph to predict promising companies that we may successfully close an offline deal with. Based on the statistical analysis on the worlds largest professional social network, LinkedIn, we demonstrate interesting discoveries showing that not all the social connections in a heterogeneous social network are useful in this task. In other words, some proper data preprocessing is essential to ensure the effectiveness of offline sales. Finally, through the experiments on LinkedIn social network data and third-party offline sales records, we demonstrate the power of HereroSales to identify potential enterprise customers in offline sales.

knowledge discovery and data mining | 2014

Class-distribution regularized consensus maximization for alleviating overfitting in model combination

Sihong Xie; Jing Gao; Wei Fan; Deepak S. Turaga; Philip S. Yu

In data mining applications such as crowdsourcing and privacy-preserving data mining, one may wish to obtain consolidated predictions out of multiple models without access to features of the data. Besides, multiple models usually carry complementary predictive information, model combination can potentially provide more robust and accurate predictions by correcting independent errors from individual models. Various methods have been proposed to combine predictions such that the final predictions are maximally agreed upon by multiple base models. Though this maximum consensus principle has been shown to be successful, simply maximizing consensus can lead to less discriminative predictions and overfit the inevitable noise due to imperfect base models. We argue that proper regularization for model combination approaches is needed to alleviate such overfitting effect. Specifically, we analyze the hypothesis spaces of several model combination methods and identify the trade-off between model consensus and generalization ability. We propose a novel model called Regularized Consensus Maximization (RCM), which is formulated as an optimization problem to combine the maximum consensus and large margin principles. We theoretically show that RCM has a smaller upper bound on generalization error compared to the version without regularization. Experiments show that the proposed algorithm outperforms a wide spectrum of state-of-the-art model combination methods on 11 tasks.

international conference on data mining | 2008

Graph-Based Iterative Hybrid Feature Selection

Erheng Zhong; Sihong Xie; Wei Fan; Jiangtao Ren; Jing Peng; Kun Zhang

When the number of labeled examples is limited, traditional supervised feature selection techniques often fail due to sample selection bias or unrepresentative sample problem. To solve this, semi-supervised feature selection techniques exploit the statistical information of both labeled and unlabeled examples in the same time. However, the results of semi-supervised feature selection can be at times unsatisfactory, and the culprit is on how to effectively use the unlabeled data. Quite different from both supervised and semi-supervised feature selection, we propose a ldquohybridrdquoframework based on graph models. We first apply supervised methods to select a small set of most critical features from the labeled data. Importantly, these initial features might otherwise be missed when selection is performed on the labeled and unlabeled examples simultaneously. Next,this initial feature set is expanded and corrected with the use of unlabeled data. We formally analyze why the expected performance of the hybrid framework is better than both supervised and semi-supervised feature selection. Experimental results demonstrate that the proposed method outperforms both traditional supervised and state-of-the-art semi-supervised feature selection algorithms by at least 10% inaccuracy on a number of text and biomedical problems with thousands of features to choose from. Software and dataset is available from the authors.

conference on information and knowledge management | 2015

Learning Entity Types from Query Logs via Graph-Based Modeling

Jingyuan Zhang; Luo Jie; Altaf Rahman; Sihong Xie; Yi Chang; Philip S. Yu

Entities (e.g., person, movie or place) play an important role in real-world applications and learning entity types has attracted much attention in recent years. Most conventional automatic techniques use large corpora, such as news articles, to learn types of entities. However, such text corpora focus on general knowledge about entities in an objective way. Hence, it is difficult to satisfy those users with specific and personalized needs for an entity. Recent years have witnessed an explosive expansion in the mining of search query logs, which contain billions of entities. The word patterns and click-throughs in search logs are not found in text corpora, thus providing a complemental source for discovering entity types based on user behaviors. In this paper, we study the problem of learning entity types from search query logs and address the following challenges: (1) queries are short texts, and information related to entities is usually very sparse; (2) large amounts of irrelevant information exists in search logs, bringing noise in detecting entity types. In this paper, we first model query logs using a bipartite graph with entities and their auxiliary information, such as contextual words and clicked URLs. Then we propose a graph-based framework called ELP (Ensemble framework based on Lable Propagation) to simultaneously learn the types of both entities and auxiliary signals. In ELP, two separate strategies are designed to fix the problems of sparsity and noise in query logs. Extensive empirical studies are conducted on real search logs to evaluate the effectiveness of the proposed ELP framework.

Information Sciences | 2015

Constructing plausible innocuous pseudo queries to protect user query intention

Zongda Wu; Jie Shi; Chenglang Lu; Enhong Chen; Guandong Xu; Guiling Li; Sihong Xie; Philip S. Yu

Users of web search engines are increasingly worried that their query activities may expose what topics they are interested in, and in turn, compromise their privacy. It would be desirable for a search engine to protect the true query intention for users without compromising the precision-recall performance. In this paper, we propose a client-based approach to address this problem. The basic idea is to issue plausible but innocuous pseudo queries together with a user query, so as to mask the user intention. First, we present a privacy model which formulates plausibility and innocuousness, and then the requirements which should be satisfied to ensure that the user intention is protected against a search engine effectively. Second, based on a semantic reference space derived from Wikipedia, we propose an approach to construct a group of pseudo queries that exhibit similar characteristic distribution as a given user query, but point to irrelevant topics, so as to meet the security requirements defined by the privacy model. Finally, we conduct extensive experimental evaluations to demonstrate the practicality and effectiveness of our approach.

international conference on big data | 2016

CER: Complementary entity recognition via knowledge expansion on large unlabeled product reviews

Hu Xu; Sihong Xie; Lei Shu; Philip S. Yu

Product reviews contain a lot of useful information about product features and customer opinions. One important product feature is the complementary entity (products) that may potentially work together with the reviewed product. Knowing complementary entities of the reviewed product is very important because customers want to buy compatible products and avoid incompatible ones. In this paper, we address the problem of Complementary Entity Recognition (CER). Since no existing method can solve this problem, we first propose a novel unsupervised method to utilize syntactic dependency paths to recognize complementary entities. Then we expand category-level domain knowledge about complementary entities using only a few general seed verbs on a large amount of unlabeled reviews. The domain knowledge helps the unsupervised method to adapt to different products and greatly improves the precision of the CER task. The advantage of the proposed method is that it does not require any labeled data for training. We conducted experiments on 7 popular products with about 1200 reviews in total to demonstrate that the proposed approach is effective.

Explore More