Gary William Flake | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gary William Flake is active.

Explore More

Publication

Featured researches published by Gary William Flake.

IEEE Computer | 2002

Self-organization and identification of Web communities

Gary William Flake; Steve Lawrence; C.L. Giles; Frans Coetzee

The vast improvement in information access is not the only advantage resulting from the increasing percentage of hyperlinked human knowledge available on the Web. Additionally, much potential exists for analyzing interests and relationships within science and society. However, the Webs decentralized and unorganized nature hampers content analysis. Millions of individuals operating independently and having a variety of backgrounds, knowledge, goals and cultures author the information on the Web. Despite the Webs decentralized, unorganized, and heterogeneous nature, our work shows that the Web self-organizes and its link structure allows efficient identification of communities. This self-organization is significant because no central authority or process governs the formation and structure of hyperlinks.

knowledge discovery and data mining | 2000

Efficient identification of Web communities

Gary William Flake; Steve Lawrence; C. Lee Giles

We de ne a communit y on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be eAEciently iden ti ed in a maximum ow / minim um cut framework, where the source is composed of known members, and the sink consists of well-kno wn non-members. A focused crawler that crawls to a xed depth can approximate community membership by augmenting the graph induced by the cra wl with links to a virtual sink node.The effectiveness of the approximation algorithm is demonstrated with several crawl results that iden tify hubs, authorities, w eb rings, and other link topologies that are useful but not easily categorized. Applications of our approach include focused cra wlers and search engines, automatic population of portal categories, and improved ltering.

Proceedings of the National Academy of Sciences of the United States of America | 2002

Winners don't take all : Characterizing the competition for links on the web

David M. Pennock; Gary William Flake; Steve Lawrence; Eric J. Glover; C. Lee Giles

As a whole, the World Wide Web displays a striking “rich get richer” behavior, with a relatively small number of sites receiving a disproportionately large share of hyperlink references and traffic. However, hidden in this skewed global distribution, we discover a qualitatively different and considerably less biased link distribution among subcategories of pages—for example, among all university homepages or all newspaper homepages. Although the connectivity distribution over the entire web is close to a pure power law, we find that the distribution within specific categories is typically unimodal on a log scale, with the location of the mode, and thus the extent of the rich get richer phenomenon, varying across different categories. Similar distributions occur in many other naturally occurring networks, including research paper citations, movie actor collaborations, and United States power grid connections. A simple generative model, incorporating a mixture of preferential and uniform attachment, quantifies the degree to which the rich nodes grow richer, and how new (and poorly connected) nodes can compete. The model accurately accounts for the true connectivity distributions of category-specific web pages, the web as a whole, and other social networks.

international world wide web conferences | 2002

Using web structure for classifying and describing web pages

Eric J. Glover; Kostas Tsioutsiouliklis; Steve Lawrence; David M. Pennock; Gary William Flake

The structure of the web is increasingly being used to improve organization, search, and analysis of information on the web. For example, Google uses the text in citing documents (documents that link to the target document) for search. We analyze the relative utility of document text, and the text in citing documents near the citation, for classification and description. Results show that the text in citing documents, when available, often has greater discriminative and descriptive power than the text in the target document itself. The combination of evidence from a document and citing documents can improve on either information source alone. Moreover, by ranking words and phrases in the citing documents according to expected entropy loss, we are able to accurately name clusters of web pages, even with very few positive examples. Our results confirm, quantify, and extend previous research using web structure in these areas, introducing new methods for classification and description of pages.

Internet Mathematics | 2004

Graph Clustering and Minimum Cut Trees

Gary William Flake; Robert Endre Tarjan; Kostas Tsioutsiouliklis

In this paper, we introduce simple graph clustering methods based on minimum cuts within the graph. The clustering methods are general enough to apply to any kind of graph but are well suited for graphs where the link structure implies a notion of reference, similarity, or endorsement, such as web and citation graphs. We show that the quality of the produced clusters is bounded by strong minimum cut and expansion criteria. We also develop a framework for hierarchical clustering and present applications to real-world data. We conclude that the clustering algorithms satisfy strong theoretical criteria and perform well in practice.

Machine Learning | 2002

Efficient SVM Regression Training with SMO

Gary William Flake; Steve Lawrence

The sequential minimal optimization algorithm (SMO) has been shown to be an effective method for training support vector machines (SVMs) on classification tasks defined on sparse data sets. SMO differs from most SVM algorithms in that it does not require a quadratic programming solver. In this work, we generalize SMO so that it can handle regression problems. However, one problem with SMO is that its rate of convergence slows down dramatically when data is non-sparse and when there are many support vectors in the solution—as is often the case in regression—because kernel function evaluations tend to dominate the runtime in this case. Moreover, caching kernel function outputs can easily degrade SMOs performance even more because SMO tends to access kernel function outputs in an unstructured manner. We address these problems with several modifications that enable caching to be effectively used with SMO. For regression problems, our modifications improve convergence time by over an order of magnitude.

IEEE Computer | 2001

Persistence of Web references in scientific research

Steve Lawrence; David M. Pennock; Gary William Flake; R. Krovetz; Frans Coetzee; Eric J. Glover; Finn Årup Nielsen; A. Kruger; C.L. Giles

The lack of persistence of Web references has called into question the increasingly common practice of citing URLs in scientific papers. It is argued that although few critical resources have been lost to date, new strategies to manage Internet resources and improved citation practices are necessary to minimize the future loss of information.

symposium on applications and the internet | 2001

Improving category specific Web search by learning query modifications

Eric J. Glover; Gary William Flake; Steve Lawrence; W.P. Birmingham; A. Kruger; C.L. Giles; D.M. Pennock

Users looking for documents within specific categories may have a difficult time locating valuable documents using general purpose search engines. We present an automated method for learning query modifications that can dramatically improve precision for locating pages within specified categories using Web search engines. We also present a classification procedure that can recognize pages in a specific category with high precision, using textual content, text location and HTML structure. Evaluation shows that the approach is highly effective for locating personal homepages and calls for papers. These algorithms are used to improve category specific search in the Inquirus 2 search engine.

Neural Networks for Signal Processing XI: Proceedings of the 2001 IEEE Signal Processing Society Workshop (IEEE Cat. No.01TH8584) | 2001

Artist detection in music with Minnowmatch

Brian Whitman; Gary William Flake; Steve Lawrence

In this paper we demonstrate the artist detection component of Minnowmatch, a machine listening and music retrieval engine. Minnowmatch (Mima) automatically determines various meta-data and makes classifications concerning a piece of audio using neural networks and support vector machines. The technologies developed in Minnowmatch may be used to create audio information retrieval systems, copyright protection devices, and recommendation agents. This paper concentrates on the artist or source detection component of Mima, which we show to classify a one-in-n artist space correctly 91% over a small song-set and 70% over a larger song set. We show that scaling problems using only neural networks for classification can be addressed with a pre-classification step of multiple support vector machines.

conference on information and knowledge management | 2000

DEADLINER: building a new niche search engine

A. Kruger; C.L. Giles; Frans Coetzee; Eric J. Glover; Gary William Flake; Steve Lawrence; C. Omlin

We present DEADLINER, a search engine that catalogs conference and workshop announcements, and ultimately will monitor and extract a wide range of academic convocation material from the web. The system currently extracts speakers, locations, dates, paper submission (and other) deadlines, topics, program committees, abstracts, and aAEliations. A user or user agent can perform detailed searches on these elds. DEADLINER was constructed using a methodology for rapid implementation of specialized search engines. This methodology avoids complex hand-tuned text extraction solutions, or natural language processing, by Bayesian integration of simple extractors that exploit loose formatting and keyw ord con ventions. The Bayesian framework further produces a search engine where each user can control the false alarm rate on a eld in an intuitive yet rigorous fashion.

Explore More