Guang Xiang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guang Xiang is active.

Explore More

Publication

Featured researches published by Guang Xiang.

ACM Transactions on Information and System Security | 2011

CANTINA+: A Feature-Rich Machine Learning Framework for Detecting Phishing Web Sites

Guang Xiang; Jason I. Hong; Carolyn Penstein Rosé; Lorrie Faith Cranor

Phishing is a plague in cyberspace. Typically, phish detection methods either use human-verified URL blacklists or exploit Web page features via machine learning techniques. However, the former is frail in terms of new phish, and the latter suffers from the scarcity of effective features and the high false positive rate (FP). To alleviate those problems, we propose a layered anti-phishing solution that aims at (1) exploiting the expressiveness of a rich set of features with machine learning to achieve a high true positive rate (TP) on novel phish, and (2) limiting the FP to a low level via filtering algorithms. Specifically, we proposed CANTINA+, the most comprehensive feature-based approach in the literature including eight novel features, which exploits the HTML Document Object Model (DOM), search engines and third party services with machine learning techniques to detect phish. Moreover, we designed two filters to help reduce FP and achieve runtime speedup. The first is a near-duplicate phish detector that uses hashing to catch highly similar phish. The second is a login form filter, which directly classifies Web pages with no identified login form as legitimate. We extensively evaluated CANTINA+ with two methods on a diverse spectrum of corpora with 8118 phish and 4883 legitimate Web pages. In the randomized evaluation, CANTINA+ achieved over 92% TP on unique testing phish and over 99% TP on near-duplicate testing phish, and about 0.4% FP with 10% training phish. In the time-based evaluation, CANTINA+ also achieved over 92% TP on unique testing phish, over 99% TP on near-duplicate testing phish, and about 1.4% FP under 20% training phish with a two-week sliding window. Capable of achieving 0.4% FP and over 92% TP, our CANTINA+ has been demonstrated to be a competitive anti-phishing solution.

international world wide web conferences | 2009

A hybrid phish detection approach by identity discovery and keywords retrieval

Guang Xiang; Jason I. Hong

Phishing is a significant security threat to the Internet, which causes tremendous economic loss every year. In this paper, we proposed a novel hybrid phish detection method based on information extraction (IE) and information retrieval (IR) techniques. The identity-based component of our method detects phishing webpages by directly discovering the inconsistency between their identity and the identity they are imitating. The keywords-retrieval component utilizes IR algorithms exploiting the power of search engines to identify phish. Our method requires no training data, no prior knowledge of phishing signatures and specific implementations, and thus is able to adapt quickly to constantly appearing new phishing patterns. Comprehensive experiments over a diverse spectrum of data sources with 11449 pages show that both components have a low false positive rate and the stacked approach achieves a true positive rate of 90.06% with a false positive rate of 1.95%.

conference on information and knowledge management | 2012

Detecting offensive tweets via topical feature discovery over a large scale twitter corpus

Guang Xiang; Bin Fan; Ling Wang; Jason I. Hong; Carolyn Penstein Rosé

In this paper, we propose a novel semi-supervised approach for detecting profanity-related offensive content in Twitter. Our approach exploits linguistic regularities in profane language via statistical topic modeling on a huge Twitter corpus, and detects offensive tweets using automatically these generated features. Our approach performs competitively with a variety of machine learning (ML) algorithms. For instance, our approach achieves a true positive rate (TP) of 75.1% over 4029 testing tweets using Logistic Regression, significantly outperforming the popular keyword matching baseline, which has a TP of 69.7%, while keeping the false positive rate (FP) at the same level as the baseline at about 3.77%. Our approach provides an alternative to large scale hand annotation efforts required by fully supervised learning approaches.

acm multimedia | 2012

Leveraging high-level and low-level features for multimedia event detection

Lu Jiang; Alexander G. Hauptmann; Guang Xiang

This paper addresses the challenge of Multimedia Event Detection by proposing a novel method for high-level and low-level features fusion based on collective classification. Generally, the method consists of three steps: training a classifier from low-level features; encoding high-level features into graphs; and diffusing the scores on the established graph to obtain the final prediction. The final prediction is derived from multiple graphs each of which corresponds to a high-level feature. The paper investigates two graph construction methods using logarithmic and exponential loss functions, respectively and two collective classification algorithms, i.e. Gibbs sampling and Markov random walk. The theoretical analysis demonstrates that the proposed method converges and is computationally scalable and the empirical analysis on TRECVID 2011 Multimedia Event Detection dataset validates its outstanding performance compared to state-of-the-art methods, with an added benefit of interpretability.

Proceedings of the international workshop on TRECVID video summarization | 2007

Clever clustering vs. simple speed-up for summarizing rushes

Alexander G. Hauptmann; Michael G. Christel; Wei-Hao Lin; Bryan S. Maher; Jun Yang; Robert V. Baron; Guang Xiang

This paper discusses in detail our approaches for producing the submitted summaries to TRECVID, including the two baseline methods. The cluster method performed well in terms of coverage, and adequately in terms of user satisfaction, but did take longer to review. We conducted additional evaluations using the same TRECVID assessment interface to judge 2 additional methods for summary generation: 25x (simple speed-up by 25 times), and pz (emphasizing pans and zooms). Human assessors show significant differences between the cluster, pz, and 25x approaches. The best coverage (text inclusion performance) is obtained by 25x, but at the expense of taking the most time to evaluate and perceived as the most redundant. Method pz was easier to use than cluster and had better performance on pan/zoom recall tasks, leading into discussions on how summaries can be improved with more knowledge of the anticipated users and tasks.

symposium on usable privacy and security | 2011

Smartening the crowds: computational techniques for improving human verification to fight phishing scams

Gang Liu; Guang Xiang; Bryan A. Pendleton; Jason I. Hong; Wenyin Liu

Phishing is an ongoing kind of semantic attack that tricks victims into inadvertently sharing sensitive information. In this paper, we explore novel techniques for combating the phishing problem using computational techniques to improve human effort. Using tasks posted to the Amazon Mechanical Turk human effort market, we measure the accuracy of minimally trained humans in identifying potential phish, and consider methods for best taking advantage of individual contributions. Furthermore, we present our experiments using clustering techniques and vote weighting to improve the results of human effort in fighting phishing. We found that these techniques could increase coverage over and were significantly faster than existing blacklists used today.

european symposium on research in computer security | 2010

A hierarchical adaptive probabilistic approach for zero hour phish detection

Guang Xiang; Bryan A. Pendleton; Jason I. Hong; Carolyn Penstein Rosé

Phishing attacks are a significant threat to users of the Internet, causing tremendous economic loss every year. In combating phish, industry relies heavily on manual verification to achieve a low false positive rate, which, however, tends to be slow in responding to the huge volume of unique phishing URLs created by toolkits. Our goal here is to combine the best aspects of human verified blacklists and heuristic-based methods, i.e., the low false positive rate of the former and the broad and fast coverage of the latter. To this end, we present the design and evaluation of a hierarchical blacklist-enhanced phish detection framework. The key insight behind our detection algorithm is to leverage existing human-verified blacklists and apply the shingling technique, a popular near-duplicate detection algorithm used by search engines, to detect phish in a probabilistic fashion with very high accuracy. To achieve an extremely low false positive rate, we use a filtering module in our layered system, harnessing the power of search engines via information retrieval techniques to correct false positives. Comprehensive experiments over a diverse spectrum of data sources show that our method achieves 0% false positive rate (FP) with a true positive rate (TP) of 67.15% using search-oriented filtering, and 0.03% FP and 73.53% TP without the filtering module. With incremental model building capability via a sliding window mechanism, our approach is able to adapt quickly to new phishing variants, and is thus more responsive to the evolving attacks.

ubiquitous computing | 2010