W. Scott Spangler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where W. Scott Spangler is active.

Explore More

Publication

Featured researches published by W. Scott Spangler.

Machine Learning | 2003

Feature Weighting in k -Means Clustering

Dharmendra S. Modha; W. Scott Spangler

Data sets with multiple, heterogeneous feature spaces occur frequently. We present an abstract framework for integrating multiple feature spaces in the k-means clustering algorithm. Our main ideas are (i) to represent each data object as a tuple of multiple feature vectors, (ii) to assign a suitable (and possibly different) distortion measure to each feature space, (iii) to combine distortions on different feature spaces, in a convex fashion, by assigning (possibly) different relative weights to each, (iv) for a fixed weighting, to cluster using the proposed convex k-means algorithm, and (v) to determine the optimal feature weighting to be the one that yields the clustering that simultaneously minimizes the average within-cluster dispersion and maximizes the average between-cluster dispersion along all the feature spaces. Using precision/recall evaluations and known ground truth classifications, we empirically demonstrate the effectiveness of feature weighting in clustering on several different application domains.

acm conference on hypertext | 2000

Clustering hypertext with applications to web searching

Dharmendra S. Modha; W. Scott Spangler

A method and structure of searching a database containing hypertext documents comprising searching the database using a query to produce a set of hypertext documents; and geometrically clustering the set of hypertext documents into various clusters using a toric k-means similarity measure such that documents within each cluster are similar to each other, wherein the clustering has a linear-time complexity in producing the set of hypertext documents, wherein the similarity measure comprises a weighted sum of maximized individual components of the set of hypertext documents, and wherein the clustering is based upon words contained in each hypertext document, out-links from each hypertext document, and in-links to each hypertext document.

Computational Statistics & Data Analysis | 2002

Class visualization of high-dimensional data with applications

Inderjit S. Dhillon; Dharmendra S. Modha; W. Scott Spangler

The problem of visualizing high-dimensional data that has been categorized into various classes is considered. The goal in visualizing is to quickly absorb inter-class and intra-class relationships. Towards this end, class-preserving projections of the multidimensional data onto two-dimensional planes, which can be displayed on a computer screen, are introduced. These class-preserving projections maintain the high-dimensional class structure, and are closely related to Fishers linear discriminants. By displaying sequences of such two-dimensional projections and by moving continuously from one projection to the next, an illusion of smooth motion through a multidimensional display can be created. Such sequences are called class tours. Furthermore, class-similarity graphs are overlaid on the two-dimensional projections to capture the distance relationships in the original high-dimensional space. The above visualization tools are illustrated on the classical Iris plant data, the ISOLET spoken letter data, and the PENDIGITS on-line handwriting data set. It is shown how the visual examination of the data can uncover latent class relationships.

Journal of Management Information Systems | 2003

Generating and Browsing Multiple Taxonomies Over a Document Collection

W. Scott Spangler; Jeffrey Thomas Kreulen; Justin Lessler

We present a novel system and methodology for generating and then browsing multiple taxonomies over a document collection. Taxonomies are generated using a broad set of capabilities, including meta data, key word queries, and automated clustering techniques that serve as a seed taxonomy.The taxonomy editor, eClassifier, provides powerful tools to visualize and edit each taxonomy to make it reflective of the desired theme. Cluster validation tools allow the editor to verify that documents received in the future can be automatically classified into each taxonomy with sufficiently high accuracy. In general, those seeking knowledge from a document collection may have only a vague notion of exactly what they are attempting to understand, and would like to explore related topics and concepts rather than simply being given a set of documents. For this purpose, we have developed MindMap, an interface utilizing multiple taxonomies and the ability to interact with a document collection.

international world wide web conferences | 2010

Topic initiator detection on the world wide web

Xin Jin; W. Scott Spangler; Rui Ma; Jiawei Han

In this paper we introduce a new Web mining and search technique - Topic Initiator Detection (TID) on the Web. Given a topic query on the Internet and the resulting collection of time-stamped web documents which contain the query keywords, the task of TID is to automatically return which web document (or its author) initiated the topic or was the first to discuss about the topic. To deal with the TID problem, we design a system framework and propose algorithm InitRank (Initiator Ranking) to rank the web documents by their possibility to be the topic initiator. We first extract features from the web documents and design several topic initiator indicators. Then, we propose a TCL graph which integrates the Time, Content and Link information and design an optimization framework over the graph to compute InitRank. Experiments show that compared with baseline methods, such as direct time sorting, well-known link based ranking algorithms PageRank and HITS, InitRank achieves the best overall performance with high effectiveness and robustness. In case studies, we successfully detected (1) the first web document related to a famous rumor of an Australia product banned in USA and (2) the pre-release of IBM and Google Cloud Computing collaboration before the official announcement.

knowledge discovery and data mining | 2014

Automated hypothesis generation based on mining scientific literature

W. Scott Spangler; Angela D. Wilkins; Benjamin J. Bachman; Meena Nagarajan; Tajhal Dayaram; Peter J. Haas; Sam Regenbogen; Curtis R. Pickering; Austin Comer; Jeffrey N. Myers; Ioana Stanoi; Linda Kato; Ana Lelescu; Jacques Joseph Labrie; Neha Parikh; Andreas Martin Lisewski; Lawrence A. Donehower; Ying Chen; Olivier Lichtarge

Keeping up with the ever-expanding flow of data and publications is untenable and poses a fundamental bottleneck to scientific progress. Current search technologies typically find many relevant documents, but they do not extract and organize the information content of these documents or suggest new scientific hypotheses based on this organized content. We present an initial case study on KnIT, a prototype system that mines the information contained in the scientific literature, represents it explicitly in a queriable network, and then further reasons upon these data to generate novel and experimentally testable hypotheses. KnIT combines entity detection with neighbor-text feature analysis and with graph-based diffusion of information to identify potential new properties of entities that are strongly implied by existing relationships. We discuss a successful application of our approach that mines the published literature to identify new protein kinases that phosphorylate the protein tumor suppressor p53. Retrospective analysis demonstrates the accuracy of this approach and ongoing laboratory experiments suggest that kinases identified by our system may indeed phosphorylate p53. These results establish proof of principle for automated hypothesis generation and discovery based on text mining of the scientific literature.

conference on information and knowledge management | 2002

Interactive methods for taxonomy editing and validation

W. Scott Spangler; Jeffrey Thomas Kreulen

Taxonomies are meaningful hierarchical categorizations of documents into topics reflecting the natural relationships between the documents and their business objectives. Improving the quality of these taxonomies and reducing the overall cost required to create them is an important area of research. Supervised and unsupervised text clustering are important technologies that comprise only a part of a complete solution. However, there exists a great need for the ability for a human to efficiently interact with a taxonomy during the editing and validation phase. We have developed a comprehensive approach to solving this problem, and implemented this approach in a software tool called eClassifier. eClassifier provides features to help the taxonomy editor understand and evaluate each category of a taxonomy and visualize the relationships between the categories. Multiple techniques allow the user to make changes at both the category and document level. Metrics then establish how well the resultant taxonomy can be modeled for future document classification. In this paper, we present a comprehensive set of viewing, editing and validation techniques we have implemented in the Lotus Discovery Server resulting in a significant reduction in the time required to create a quality taxonomy.

international conference on data mining | 2011

Patent Maintenance Recommendation with Patent Information Network Model

Xin Jin; W. Scott Spangler; Ying Chen; Keke Cai; Rui Ma; Li Zhang; Xian Wu; Jiawei Han

Patents are of crucial importance for businesses, because they provide legal protection for the invented techniques, processes or products. A patent can be held for up to 20 years. However, large maintenance fees need to be paid to keep it enforceable. If the patent is deemed not valuable, the owner may decide to abandon it by stopping paying the maintenance fees to reduce the cost. For large companies or organizations, making such decisions is difficult because too many patents need to be investigated. In this paper, we introduce the new patent mining problem of automatic patent maintenance prediction, and propose a systematic solution to analyze patents for recommending patent maintenance decision. We model the patents as a heterogeneous time-evolving information network and propose new patent features to build model for a ranked prediction on whether to maintain or abandon a patent. In addition, a network-based refinement approach is proposed to further improve the performance. We have conducted experiments on the large scale United States Patent and Trademark Office (USPTO) database which contains over four million granted patents. The results show that our technique can achieve high performance.

hawaii international conference on system sciences | 2002

MindMap: utilizing multiple taxonomies and visualization to understand a document collection

W. Scott Spangler; Jeffrey Thomas Kreulen; Justin Lessler

We present a novel system and methodology for browsing and exploring topics and concepts within a document collection. The process begins with the generation of multiple taxonomies from the document collections, each having a unique theme. We have developed the MindMap interface to the document collection. Starting from an initial keyword query, the MindMap interface helps the user to explore the concept space by first presenting the user with related terms and high level topics in a radial graph. After refining the query by selecting any related terms, one of the related high level concepts can be selected for further investigation. The MindMap uses a novel binary tree interface to explore the composition of a concept based on the presence or absence of terms. From the binary tree a concept can be further explored and visualized. Individual documents are presented as spatial coordinates where distance between points relates to document similarity. As the user browses this spatial representation, text is presented from the document that is most relevant to the users initial query. Individual points can be selected to pull up the relevant paragraphs from the document with the keywords highlighted. Finally, selected documents are displayed and the user is allowed to further interact and investigate.

Web Intelligence and Agent Systems: An International Journal | 2010

Leveraging sentiment analysis for topic detection

Keke Cai; W. Scott Spangler; Ying Chen; Li Zhang

The emergence of new social media such as blogs, message boards, news, and web content in general has dramatically changed the ecosystems of corporations. Consumers, non-profit organizations, and other forms of communities are extremely vocal about their opinions and perceptions on companies and their brands on the web. The ability to leverage such “voice of the web” to gain consumer, brand, and market insights can be truly differentiating and valuable to todays corporations. In particular, one important form of insights can be derived from sentiment analysis on web content. Sentiment analysis traditionally emphasizes on classification of web comments into positive, neutral, and negative categories. This paper goes beyond sentiment classification by focusing on techniques that could detect the topics that are highly correlated with the positive and negative opinions. Such techniques, when coupled with sentiment classification, can help the business analysts to understand both the overall sentiment scope as well as the drivers behind the sentiment. In this paper, we describe our overall sentiment analysis system that consists of such sentiment analysis techniques, including the bootstrapping method for word polarities weighting, automatic filtering and expansion for domain word, and a sentiment classification method. We then detail a novel topic detection method using point-wise mutual information and term frequency distribution. We demonstrate the effectiveness of our overall approaches via several case studies on different social media data sets.

Explore More