Isaac G. Councill | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Isaac G. Councill is active.

Explore More

Publication

Featured researches published by Isaac G. Councill.

acm/ieee joint conference on digital libraries | 2007

Efficient topic-based unsupervised name disambiguation

Yang Song; Jian Huang; Isaac G. Councill; Jia Li; C. Lee Giles

Name ambiguity is a special case of identity uncertainty where one person can be referenced by multiple name variations in different situations or even share the same name with other people. In this paper, we focus on the problem of disambiguating person names within web pages and scientific documents. We present an efficient and effective two-stage approach to disambiguate names. In the first stage, two novel topic-based models are proposed by extending two hierarchical Bayesian text models, namely Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). Our models explicitly introduce a new variable for persons and learn the distribution of topics with regard to persons and words. After learning an initial model, the topic distributions are treated as feature sets and names are disambiguated by leveraging a hierarchical agglomerative clustering method. Experiments on web data and scientific documents from CiteSeer indicate that our approach consistently outperforms other unsupervised learning methods such as spectral clustering and DBSCAN clustering and could be extended to other research fields. We empirically addressed the issue of scalability by disambiguating authors in over 750,000 papers from the entire CiteSeer dataset.

international conference on data mining | 2007

Discovering Temporal Communities from Social Network Documents

Ding Zhou; Isaac G. Councill; Hongyuan Zha; C.L. Giles

This paper studies the discovery of communities from social network documents produced over time, addressing the discovery of temporal trends in community memberships. We first formulate static community discovery at a single time period as a tripartite graph partitioning problem. Then we propose to discover the temporal communities by threading the statically derived communities in different time periods using a new constrained partitioning algorithm, which partitions graphs based on topology as well as prior information regarding vertex membership. We evaluate the proposed approach on synthetic datasets and a real-world dataset prepared from the CiteSeer.

international workshop on peer to peer systems | 2005

OverCite: a cooperative digital research library

Jeremy Stribling; Isaac G. Councill; Jinyang Li; M. Frans Kaashoek; David R. Karger; Robert Tappan Morris; Scott Shenker

CiteSeer is a well-known online resource for the computer science research community, allowing users to search and browse a large archive of research papers. Unfortunately, its current centralized incarnation is costly to run. Although members of the community would presumably be willing to donate hardware and bandwidth at their own sites to assist CiteSeer, the current architecture does not facilitate such distribution of resources. OverCite is a proposal for a new architecture for a distributed and cooperative research library based on a distributed hash table (DHT). The new architecture will harness resources at many sites, and thereby be able to support new features such as document alerts and scale to larger data sets.

international world wide web conferences | 2006

CiteSeerx: an architecture and web service design for an academic document search engine

Huajing Li; Isaac G. Councill; Wang-Chien Lee; C. Lee Giles

CiteSeer is a scientific literature digital library and search engine which automatically crawls and indexes scientific documents in the field of computer and information science. After serving as a public search engine for nearly ten years, CiteSeer is starting to have scaling problems for handling of more documents, adding new feature and more users. Its monolithic architecture design prevents it from effectively making use of new web technologies and providing new services. After analyzing the current system problems, we propose a new architecture and data model, CiteSeerx. CiteSeerx that will overcome the existing problems as well as provide scalability and better performance plus new services and system features.

web intelligence | 2010

The Ethicality of Web Crawlers

Yang Sun; Isaac G. Councill; C. Lee Giles

Search engines largely rely on web crawlers to collect information from the web. This has led to an enormous amount of web traffic generated by crawlers alone. To minimize negative aspects of this traffic on websites, the behaviors of crawlers may be regulated at an individual web server by implementing the Robots Exclusion Protocol in a file called “robots.txt”. Although not an official standard, the Robots Exclusion Protocol has been adopted to a greater or lesser extent by nearly all commercial search engines and popular crawlers. As many web site administrators and policy makers have come to rely on the informal contract set forth by the Robots Exclusion Protocol, the degree to which web crawlers respect robots.txt policies has become an important issue of computer ethics. In this research, we investigate and define rules to measure crawler ethics, referring to the extent to which web crawlers respect the regulations set forth in robots.txt configuration files. We test the behaviors of web crawlers in terms of ethics by deploying a crawler honeypot: a set of websites where each site is configured with a distinct regulation specification using the Robots Exclusion Protocol in order to capture specific behaviors of web crawlers.We propose a vector space model to represent crawler behavior and a set of models to measure the ethics of web crawlers based on their behaviors. The results show that ethicality scores vary significantly among crawlers. Most commercial web crawlers receive fairly low ethicality violation scores which means most of the crawlers’ behaviors are ethical; however, many commercial crawlers still consistently violate or misinterpret certain robots.txt rules.

european conference on research and advanced technology for digital libraries | 2005

A comparison of on-line computer science citation databases

Vaclav Petricek; Ingemar J. Cox; Hui Han; Isaac G. Councill; C. Lee Giles

This paper examines the difference and similarities between the two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously via a crawl of the Web and automatic processing of user submissions. CiteSeers autonomous citation database can be considered a form of self-selected on-line survey. It is important to understand the limitations of such databases, particularly when citation information is used to assess the performance of authors, institutions and funding bodies. We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 24% of the entire literature of Computer Science. CiteSeer is also biased against low-cited papers. Despite their difference, both databases exhibit similar and significantly different citation distributions compared with previous analysis of the Physics community. In both databases, we also observe that the number of authors per paper has been increasing over time.

international conference on knowledge capture | 2005

Automatic acknowledgement indexing: expanding the semantics of contribution in the CiteSeer digital library

Isaac G. Councill; C. Lee Giles; Hui Han; Eren Manavoglu

Acknowledgements in research publications, like citations, indicate influential contributions to scientific work; however, large-scale acknowledgement analyses have traditionally been impractical due to the high cost of manual information extraction. In this paper we describe a mixture method for automatically mining acknowledgements from research documents using a combination of a Support Vector Machine and regular expressions. The algorithm has been implemented as a plug-in to the CiteSeer Digital Library and the extraction results have been integrated with the traditional metadata and citation index of the CiteSeer system. As a demonstration, we use CiteSeers autonomous citation indexing (ACI) feature to measure the relative impact of acknowledged entities, and present the top twenty acknowledged entities within the archive.

international conference on service oriented computing | 2004

A service-oriented architecture for digital libraries

Yves Petinot; C. Lee Giles; Vivek Bhatnagar; Pradeep B. Teregowda; Hui Han; Isaac G. Councill

CiteSeer is currently a very large source of meta-data information on the World Wide Web (WWW). This meta-data is the key material for the Semantic Web. Still, CiteSeer is not yet a Semantic-enabled service and therefore its meta-data, although potentially usable by Semantic Web agents, is not yet reachable using the Semantic Web mechanisms. The complexity of CiteSeer, that is the range of tasks it supports, make the transition to a Semantic-enabled service a non-trivial task. While human users tend to perceive CiteSeer as a single well-integrated service, we believe it is best seen - from a machine perspective - as a collection of services, each service performing a specific task. In this paper we show our approach to enable CiteSeer on the Semantic Web in order to allow the use of its meta-data through the Semantic Web. We first introduce an intuitive Application Programming Interface (API) to the CiteSeer software, then show that an efficient integration of CiteSeer in the Semantic Web can be best achieved by independently integrating the services that comprise it. We believe the effort presented here towards the Semantic-integration of a complex Information Retrieval system could be used as an integration model for arbitrary systems.

web information and data management | 2008

Personalized ranking for digital libraries based on log analysis

Yang Sun; Huajing Li; Isaac G. Councill; Jian Huang; Wang-Chien Lee; C. Lee Giles

Given the exponential increase of indexable context on the Web, ranking is an increasingly difficult problem in information retrieval systems. Recent research shows that implicit feedback regarding user preferences can be extracted from web access logs in order to increase ranking performance. We analyze the implicit user feedback from access logs in the CiteSeer academic search engine and show how site structure can better inform the analysis of clickthrough feedback providing accurate personalized ranking services tailored to individual information retrieval systems. Experiment and analysis shows that our proposed method is more accurate on predicting user preferences than any non-personalized ranking methods when user preferences are stable over time. We compare our method with several non-personalized ranking methods including ranking SVMlight as well as several ranking functions specific to the academic document domain. The results show that our ranking algorithm can reach 63.59% accuracy in comparison to 50.02% for ranking SVMlight and below 43% for all other single feature ranking methods. We also show how the derived personalized ranking vectors can be employed for other ranking-related purposes such as recommendation systems.

Information Processing and Management | 2008

Design and evaluation of awareness mechanisms in CiteSeer

Umer Farooq; Craig H. Ganoe; John M. Carroll; Isaac G. Councill; C. Lee Giles

Awareness has been extensively studied in human computer interaction (HCI) and computer supported cooperative work (CSCW). The success of many collaborative systems hinges on effectively supporting awareness of different collaborators, their actions, and the process of creating shared work products. As digital libraries are increasingly becoming more than just repositories for information search and retrieval - essentially fostering collaboration among its community of users - awareness remains an unexplored research area in this domain. We are investigating awareness mechanisms in CiteSeer, a scholarly digital library for the computer and information science domain. CiteSeer users can be notified of new publication events (e.g., publication of a paper that cites one of their papers) using feeds as notification systems. We present three cumulative user studies - requirements elicitation, prototype evaluation, and naturalistic study - in the context of supporting CiteSeer feeds. Our results indicate that users prefer feeds that place target items in query-relevant contexts, and that preferred context varies with type of publication event. We found that users integrated feeds as part of their broader, everyday activities and used them as planning tools to collaborate with others.

Explore More