Eui-Hong Han | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Eui-Hong Han is active.

Explore More

Publication

Featured researches published by Eui-Hong Han.

IEEE Computer | 1999

Chameleon: hierarchical clustering using dynamic modeling

George Karypis; Eui-Hong Han; Vipin Kumar

Clustering is a discovery process in data mining. It groups a set of data in a way that maximizes the similarity within clusters and minimizes the similarity between two different clusters. Many advanced algorithms have difficulty dealing with highly variable clusters that do not follow a preconceived model. By basing its selections on both interconnectivity and closeness, the Chameleon algorithm yields accurate results for these highly variable clusters. Existing algorithms use a static model of the clusters and do not use information about the nature of individual clusters as they are merged. Furthermore, one set of schemes (the CURE algorithm and related schemes) ignores the information about the aggregate interconnectivity of items in two clusters. Another set of schemes (the Rock algorithm, group averaging method, and related schemes) ignores information about the closeness of two clusters as defined by the similarity of the closest items across two clusters. By considering either interconnectivity or closeness only, these algorithms can select and merge the wrong pair of clusters. Chameleons key feature is that it accounts for both interconnectivity and closeness in identifying the most similar pair of clusters. Chameleon finds the clusters in the data set by using a two-phase algorithm. During the first phase, Chameleon uses a graph partitioning algorithm to cluster the data items into several relatively small subclusters. During the second phase, it uses an algorithm to find the genuine clusters by repeatedly combining these subclusters.

international conference on management of data | 1997

Scalable parallel data mining for association rules

Eui-Hong Han; George Karypis; Vipin Kumar

One of the important problems in data mining is discovering association rules from databases of transactions where each transaction consists of a set of items. The most time consuming operation in this discovery process is the computation of the frequency of the occurrences of interesting subset of items (called candidates) in the database of transactions. To prune the exponentially large space of candidates, most existing algorithms, consider only those candidates that have a user defined minimum support. Even with the pruning, the task of finding all association rules requires a lot of computation power and time. Parallel computers offer a potential solution to the computation requirement of this task, provided efficient and scalable parallel algorithms can be designed. In this paper, we present two new parallel algorithms for mining association rules. The Intelligent Data Distribution algorithm efficiently uses aggregate memory of the parallel computer by employing intelligent candidate partitioning scheme and uses efficient communication mechanism to move data among the processors. The Hybrid Distribution algorithm further improves upon the Intelligent Data Distribution algorithm by dynamically partitioning the candidate set to maintain good load balance. The experimental results on a Cray T3D parallel computer show that the Hybrid Distribution algorithm scales linearly and exploits the aggregate memory better and can generate more association rules with a single scan of database per pass.

pacific asia conference on knowledge discovery and data mining | 2001

Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification

Eui-Hong Han; George Karypis; Vipin Kumar

Text categorization presents unique challenges due to the large number of attributes present in the data set, large number of training samples, attribute dependency, and multi-modality of categories. Existing classification techniques have limited applicability in the data sets of these natures. In this paper, we present a Weight Adjusted k-Nearest Neighbor (WAKNN) classification that learns feature weights based on a greedy hill climbing technique. We also present two performance optimizations of WAKNN that improve the computational performance by a few orders of magnitude, but do not compromise on the classification quality. We experimentally evaluated WAKNN on 52 document data sets from a variety of domains and compared its performance against several classification algorithms, such as C4.5, RIPPER, Naive-Bayesian, PEBLS and VSM. Experimental results on these data sets confirm that WAKNN consistently outperforms other existing classification algorithms.

decision support systems | 1999

Partitioning-based clustering for Web document categorization

Daniel Boley; Maria L. Gini; Robert A. Gross; Eui-Hong Han; George Karypis; Vipin Kumar; Bamshad Mobasher; Jerome Moore; Kyle Hastings

Abstract Clustering techniques have been used by many intelligent software agents in order to retrieve, filter, and categorize documents available on the World Wide Web. Clustering is also useful in extracting salient features of related Web documents to automatically formulate queries and search for other similar documents on the Web. Traditional clustering algorithms either use a priori knowledge of document structures to define a distance or similarity among these documents, or use probabilistic techniques such as Bayesian classification. Many of these traditional algorithms, however, falter when the dimensionality of the feature space becomes high relative to the size of the document space. In this paper, we introduce two new clustering algorithms that can effectively cluster documents, even in the presence of a very high dimensional feature space. These clustering techniques, which are based on generalizations of graph partitioning, do not require pre-specified ad hoc distance functions, and are capable of automatically discovering document similarities or associations. We conduct several experiments on real Web data using various feature selection heuristics, and compare our clustering schemes to standard distance-based techniques, such as hierarchical agglomeration clustering , and Bayesian classification methods, such as AutoClass .

adaptive agents and multi-agents systems | 1998

WebACE: a Web agent for document categorization and exploration

Eui-Hong Han; Daniel Boley; Maria L. Gini; Robert A. Gross; Kyle Hastings; George Karypis; Vipin Kumar; Bamshad Mobasher; Jerome Moore

We propose an agent for exploring and categorizing documents on the World Wide Web based on a user pro le. The heart of the agent is an automatic categorization of a set of documents, combined with a process for generating new queries used to search for new related documents and ltering the resulting documents to extract the set of documents most closely related to the starting set. The document categories are not given a-priori. The resulting document set could also be used to update the initial set of documents. We present the overall architecture and describe two novel algorithms which provide signi cant improvement over traditional clustering algorithms and form the basis for the query generation and search component of the agent.

Artificial Intelligence Review | 1999

Document Categorization and Query Generation on the World Wide WebUsing WebACE

Daniel Boley; Maria L. Gini; Robert A. Gross; Eui-Hong Han; Kyle Hastings; George Karypis; Vipin Kumar; Bamshad Mobasher; Jerome Moore

We present WebACE, an agent for exploring and categorizing documents onthe World Wide Web based on a user profile. The heart of the agent is anunsupervised categorization of a set of documents, combined with a processfor generating new queries that is used to search for new relateddocuments and for filtering the resulting documents to extract the onesmost closely related to the starting set. The document categories are notgiven a priori. We present the overall architecture and describe twonovel algorithms which provide significant improvement over HierarchicalAgglomeration Clustering and AutoClass algorithms and form the basis forthe query generation and search component of the agent. We report on theresults of our experiments comparing these new algorithms with moretraditional clustering algorithms and we show that our algorithms are fastand sacalable.

conference on information and knowledge management | 2000

Fast supervised dimensionality reduction algorithm with applications to document categorization & retrieval

George Karypis; Eui-Hong Han

Retriev al techniques based on dimensionalit y reduction, such as Latent Semantic Indexing (LSI), have been shown to improve the quality of the information being retrieved by capturing the latent meaning of the words present in the documents. Unfortunately, the high computational and memory requirements of LSI and its inabilit yto compute an e ective dimensionality reduction in a supervised setting limits its applicability. In this paper we present a fast supervised dimensionality reduction algorithm that is derived from the recen tly dev eloped cluster-based unsupervised dimensionality reduction algorithms. We experimentally evaluate the quality of the low er dimensional spaces both in the context of document categorization and improvements in retrieval performance on a variety of di erent document collections. Our experiments show that the lower dimensional spaces computed by our algorithm consistently improve the performance of traditional algorithms such as C4.5, k-nearestneigh bor, and Support Vector Machines (SVM), by an average of 2% to 7%. Furthermore, the supervised lower dimensional space greatly improves the retriev al performance when compared to LSI.

Data Mining and Knowledge Discovery | 1999

Parallel Formulations of Decision-Tree Classification Algorithms

Anurag Srivastava; Eui-Hong Han; Vipin Kumar; Vineet Singh

Classification decision tree algorithms are used extensively for data mining in many domains such as retail target marketing, fraud detection, etc. Highly parallel algorithms for constructing classification decision trees are desirable for dealing with large data sets in reasonable amount of time. Algorithms for building classification decision trees have a natural concurrency, but are difficult to parallelize due to the inherent dynamic nature of the computation. In this paper, we present parallel formulations of classification decision tree learning algorithm based on induction. We describe two basic parallel formulations. One is based on Synchronous Tree Construction Approach and the other is based on Partitioned Tree Construction Approach. We discuss the advantages and disadvantages of using these methods and propose a hybrid method that employs the good features of these methods. We also provide the analysis of the cost of computation and communication of the proposed hybrid method. Moreover, experimental results on an IBM SP-2 demonstrate excellent speedups and scalability.

conference on information and knowledge management | 2005

Feature-based recommendation system

Eui-Hong Han; George Karypis

The explosive growth of the world-wide-web and the emergence of e-commerce has led to the development of recommender systems--a personalized information filtering technology used to identify a set of N items that will be of interest to a certain user. User-based and model-based collaborative filtering are the most successful technology for building recommender systems to date and is extensively used in many commercial recommender systems. The basic assumption in these algorithms is that there are sufficient historical data for measuring similarity between products or users. However, this assumption does not hold in various application domains such as electronics retail, home shopping network, on-line retail where new products are introduced and existing products disappear from the catalog. Another such application domains is home improvement retail industry where a lot of products (such as window treatments, bathroom, kitchen or deck) are custom made. Each product is unique and there are very little duplicate products. In this domain, the probability of the same exact two products bought together is close to zero. In this paper, we discuss the challenges of providing recommendation in the domains where no sufficient historical data exist for measuring similarity between products or users. We present feature-based recommendation algorithms that overcome the limitations of the existing top-n recommendation algorithms. The experimental evaluation of the proposed algorithms in the real life data sets shows a great promise. The pilot project deploying the proposed feature-based recommendation algorithms in the on-line retail web site shows 75% increase in the recommendation revenue for the first 2 month period.

international conference on data mining | 2010

Content-Based Methods for Predicting Web-Site Demographic Attributes

Santosh Kabbur; Eui-Hong Han; George Karypis

Demographic information plays an important role in gaining valuable insights about a web-sites user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyper linked structure and not relying on any information directly or indirectly obtained from the web-sites users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the webs hyper linked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8-10% and provide insights on how to best train and apply such models.

Explore More