Brian D. Davison
Lehigh University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Brian D. Davison.
knowledge discovery and data mining | 2010
Liangjie Hong; Brian D. Davison
Social networks such as Facebook, LinkedIn, and Twitter have been a crucial source of information for a wide spectrum of users. In Twitter, popular information that is deemed important by the community propagates through the network. Studying the characteristics of content in the messages becomes important for a number of tasks, such as breaking news detection, personalized message recommendation, friends recommendation, sentiment analysis and others. While many researchers wish to use standard text mining tools to understand messages on Twitter, the restricted length of those messages prevents them from being employed to their full potential. We address the problem of using standard topic models in micro-blogging environments by studying how the models can be trained on the dataset. We propose several schemes to train a standard topic model and compare their quality and effectiveness through a set of carefully designed experiments from both qualitative and quantitative perspectives. We show that by training a topic model on aggregated messages we can obtain a higher quality of learned model which results in significantly better performance in two real-world classification problems. We also discuss how the state-of-the-art Author-Topic model fails to model hierarchical relationships between entities in Social Media.
ACM Computing Surveys | 2009
Xiaoguang Qi; Brian D. Davison
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
international acm sigir conference on research and development in information retrieval | 2000
Brian D. Davison
Most web pages are linked to others with related content. This idea, combined with another that says that text in, and possibly around, HTML anchors describe the pages to which they point, is the foundation for a usable World-Wide Web. In this paper, we examine to what extent these ideas hold by empirically testing whether topical locality mirrors spatial locality of pages on the Web. In particular, we find that the likelihood of linked pages having similar textual content to be high; the similarity of sibling pages increases when the links from the parent are close together; titles, descriptions, and anchor text represent at least part of the target page; and that anchor text may be a useful discriminator among unseen child pages. These results show the foundations necessary for the success of many web systems, including search engines, focused crawlers, linkage analyzers, and intelligent web agents.
international world wide web conferences | 2006
Baoning Wu; Vinay Goel; Brian D. Davison
Web spam is behavior that attempts to deceive search engine ranking algorithms. TrustRank is a recent algorithm that can combat web spam. However, TrustRank is vulnerable in the sense that the seed set used by TrustRank may not be sufficiently representative to cover well the different topics on the Web. Also, for a given seed set, TrustRank has a bias towards larger communities. We propose the use of topical information to partition the seed set and calculate trust scores for each topic separately to address the above issues. A combination of these trust scores for a page is used to determine its ranking. Experimental results on two large datasets show that our Topical TrustRank has a better performance than TrustRank in demoting spam sites or pages. Compared to TrustRank, our best technique can decrease spam from the top ranked sites by as much as 43.1%.
IEEE Internet Computing | 2001
Brian D. Davison
The article provides a primer on Web resource caching, one technology used to make the Web scalable. Web caching can reduce bandwidth usage, decrease user-perceived latencies, and reduce Web server loads transparently. As a result, caching has become a significant part of the Webs infrastructure. Caching has even spawned a new industry: content delivery networks, which are also growing at a fantastic rate. Readers familiar with relatively advanced Web caching topics such as the Internet Cache Protocol (ICP), invalidation, and interception proxies are not likely to learn much here. Instead, the article is designed for the general audience of Web users. Rather than a how-to guide to caching technology deployment, it is a high-level argument for the value of Web caching to content consumers and producers. The article defines caching, explains how it applies to the Web, and describes when and why it is useful.
international acm sigir conference on research and development in information retrieval | 2006
Lan Nie; Brian D. Davison; Xiaoguang Qi
Traditional web link-based ranking schemes use a single score to measure a pages authority without concern of the community from which that authority is derived. As a result, a resource that is highly popular for one topic may dominate the results of another topic in which it is less authoritative. To address this problem, we suggest calculating a score vector for each page to distinguish the contribution from different topics, using a random walk model that probabilistically combines page topic distribution and link structure. We show how to incorporate the topical model within both PageRank and HITS without affecting the overall property and still render insight into topic-level transition. Experiments on multiple datasets indicate that our technique outperforms other ranking approaches that incorporate textual analysis.
acm conference on hypertext | 2002
Brian D. Davison
Most proposed Web prefetching techniques make predictions based on the historical references to requested objects. In contrast, this paper examines the accuracy of predicting a users next action based on analysis of the content of the pages requested recently by the user. Predictions are made using the similarity of a model of the users interest to the text in and around the hypertext anchors of recently requested Web pages. This approa22ch can make predictions of actions that have never been taken by the user and potentially make predictions that reflect current user interests. We evaluate this technique using data from a full-content log of Web activity and find that textual similarity-based predictions outperform simpler approaches.
Communications of The ACM | 2000
Haym Hirsh; Chumki Basu; Brian D. Davison
102 August 2000/Vol. 43, No. 8 COMMUNICATIONS OF THE ACM question in the design of such self-customizing software is what kind of patterns can be recognized by the learning algorithms. At one end, the system may do little more than recognize superficial patterns in a single user’s interactions. At the other, the system may exploit deeper knowledge about the user, what tasks the user is performing, as well as information about what other users have previously done. The challenge becomes one of identifying what information is available for the given “learning to personalize” task and what methods are best suited to the available information. When I used the email program on my PC to forward the file of this article to the editor of this magazine I executed a series of actions that are mostly the same ones I would take to forward any file to another user. I typically click on an item on a menu that pops up a window for the composition of an email message. A fairly routine sequence of actions then follows—I compose the message, select a menu item that creates a pop-up window into which I enter the name of the desired file to be forwarded, finally completing the LEARNING to Personalize Haym Hirsh, Chumki Basu, and Brian D. Davison
international acm sigir conference on research and development in information retrieval | 2009
Liangjie Hong; Brian D. Davison
Discussion boards and online forums are important platforms for people to share information. Users post questions or problems onto discussion boards and rely on others to provide possible solutions and such question-related content sometimes even dominates the whole discussion board. However, to retrieve this kind of information automatically and effectively is still a non-trivial task. In addition, the existence of other types of information (e.g., announcements, plans, elaborations, etc.) makes it difficult to assume that every thread in a discussion board is about a question. We consider the problems of identifying question-related threads and their potential answers as classification tasks. Experimental results across multiple datasets demonstrate that our method can significantly improve the performance in both question detection and answer finding subtasks. We also do a careful comparison of how different types of features contribute to the final result and show that non-content features play a key role in improving overall performance. Finally, we show that a ranking scheme based on our classification approach can yield much better performance than prior published methods.
international world wide web conferences | 2006
Baoning Wu; Brian D. Davison
By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.