Pang Ning Tan | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pang Ning Tan is active.

Explore More

Publication

Featured researches published by Pang Ning Tan.

Sigkdd Explorations | 2000

Web usage mining: discovery and applications of usage patterns from Web data

Jaideep Srivastava; Robert Cooley; Mukund Deshpande; Pang Ning Tan

Web usage mining is the application of data mining techniques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based applications. Web usage mining consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. This paper describes each of these phases in detail. Given its application potential, Web usage mining has seen a rapid increase in interest, from both the research and practice communities. This paper provides a detailed taxonomy of the work in this area, including research efforts as well as commercial offerings. An up-to-date survey of the existing work is also provided. Finally, a brief overview of the WebSIFT system as an example of a prototypical Web usage mining system is given.

knowledge discovery and data mining | 2002

Selecting the right interestingness measure for association patterns

Pang Ning Tan; Vipin Kumar; Jaideep Srivastava

Many techniques for association rule mining and feature selection require a suitable metric to capture the dependencies among variables in a data set. For example, metrics such as support, confidence, lift, correlation, and collective strength are often used to determine the interestingness of association patterns. However, many such measures provide conflicting information about the interestingness of a pattern, and the best metric to use for a given application domain is rarely known. In this paper, we present an overview of various measures proposed in the statistics, machine learning and data mining literature. We describe several key properties one should examine in order to select the right measure for a given application domain. A comparative study of these properties is made using twenty one of the existing measures. We show that each measure has different properties which make them useful for some application domains, but not for others. We also present two scenarios in which most of the existing measures agree with each other, namely, support-based pruning and table standardization. Finally, we present an algorithm to select a small set of tables such that an expert can select a desirable measure by looking at just this small set of tables.

knowledge discovery and data mining | 2004

Selecting the right objective measure for association analysis

Pang Ning Tan; Vipin Kumar; Jaideep Srivastava

Objective measures such as support, confidence, interest factor, correlation, and entropy are often used to evaluate the interestingness of association patterns. However, in many situations, these measures may provide conflicting information about the interestingness of a pattern. Data mining practitioners also tend to apply an objective measure without realizing that there may be better alternatives available for their application. In this paper, we describe several key properties one should examine in order to select the right measure for a given application. A comparative study of these properties is made using twenty-one measures that were originally developed in diverse fields such as statistics, social science, machine learning, and data mining. We show that depending on its properties, each measure is useful for some application, but not for others. We also demonstrate two scenarios in which many existing measures become consistent with each other, namely, when support-based pruning and a technique known as table standardization are applied. Finally, we present an algorithm for selecting a small set of patterns such that domain experts can find a measure that best fits their requirements by ranking this small set of patterns.

Data Mining and Knowledge Discovery | 2002

Discovery of Web Robot Sessions Based on their Navigational Patterns

Pang Ning Tan; Vipin Kumar

Web robots are software programs that automatically traverse the hyperlink structure of the World Wide Web in order to locate and retrieve information. There are many reasons why it is important to identify visits by Web robots and distinguish them from other users. First of all, e-commerce retailers are particularly concerned about the unauthorized deployment of robots for gathering business intelligence at their Web sites. In addition, Web robots tend to consume considerable network bandwidth at the expense of other users. Sessions due to Web robots also make it more difficult to perform clickstream analysis effectively on the Web data. Conventional techniques for detecting Web robots are often based on identifying the IP address and user agent of the Web clients. While these techniques are applicable to many well-known robots, they may not be sufficient to detect camouflaged and previously unknown robots. In this paper, we propose an alternative approach that uses the navigational patterns in the click-stream data to determine if it is due to a robot. Experimental results on our Computer Science department Web server logs show that highly accurate classification models can be built using this approach. We also show that these models are able to discover many camouflaged and previously unidentified robots.

Data Mining and Knowledge Discovery | 2006

Hyperclique pattern discovery

Hui Xiong; Pang Ning Tan; Vipin Kumar

Existing algorithms for mining association patterns often rely on the support-based pruning strategy to prune a combinatorial search space. However, this strategy is not effective for discovering potentially interesting patterns at low levels of support. Also, it tends to generate too many spurious patterns involving items which are from different support levels and are poorly correlated. In this paper, we present a framework for mining highly-correlated association patterns called hyperclique patterns. In this framework, an objective measure called h-confidence is applied to discover hyperclique patterns. We prove that the items in a hyperclique pattern have a guaranteed level of global pairwise similarity to one another as measured by the cosine similarity (uncentered Pearsons correlation coefficient). Also, we show that the h-confidence measure satisfies a cross-support property which can help efficiently eliminate spurious patterns involving items with substantially different support levels. Indeed, this cross-support property is not limited to h-confidence and can be generalized to some other association measures. In addition, an algorithm called hyperclique miner is proposed to exploit both cross-support and anti-monotone properties of the h-confidence measure for the efficient discovery of hyperclique patterns. Finally, our experimental results show that hyperclique miner can efficiently identify hyperclique patterns, even at extremely low levels of support.

international conference on data mining | 2003

Mining strong affinity association patterns in data sets with skewed support distribution

Hui Xiong; Pang Ning Tan; Vipin Kumar

Existing association-rule mining algorithms often rely on the support-based pruning strategy to prune its combinatorial search space. This strategy is not quite effective for data sets with skewed support distributions because they tend to generate many spurious patterns involving items from different support levels or miss potentially interesting low-support patterns. To overcome these problems, we propose the concept of hyperclique pattern, which uses an objective measure called h-confidence to identify strong affinity patterns. We also introduce the novel concept of cross-support property for eliminating patterns involving items with substantially different support levels. Our experimental results demonstrate the effectiveness of this method for finding patterns in dense data sets even at very low support thresholds, where most of the existing algorithms would break down. Finally, hyperclique patterns also show great promise for clustering items in high dimensional space.

knowledge discovery and data mining | 1999

Discovery of Interesting Usage Patterns from Web Data

Robert Cooley; Pang Ning Tan; Jaideep Srivastava

Web Usage Mining is the application of data mining techniques to large Web data repositories in order to extract usage patterns. As with many data mining application domains, the identification of patterns that are considered interesting is a problem that must be solved in addition to simply generating them. Aneces sary step in identifying interesting results is quantifying what is considered uninteresting in order to form a basis for comparison. Several research efforts have relied on manually generated sets of uninteresting rules. However, manual generation of a comprehensive set of evidence about beliefs for a particular domain is impractical in many cases. Generally, domain knowledge can be used to automatically create evidence for or against a set of beliefs. This paper develops a quantitative model based on support logic for determining the interestingness of discovered patterns. For Web Usage Mining, there are three types of domain information available; usage, content, and structure. This paper also describes algorithms for using these three types of information to automatically identify interesting knowledge. These algorithms have been incorporated into the Web Site Information Filter (WebSIFT) system and examples of interesting frequent itemsets automatically discovered from real Web data are presented.

knowledge discovery and data mining | 2003

Discovery of climate indices using clustering

Michael Steinbach; Pang Ning Tan; Vipin Kumar; Steven A. Klooster; Christopher Potter

To analyze the effect of the oceans and atmosphere on land climate, Earth Scientists have developed climate indices, which are time series that summarize the behavior of selected regions of the Earths oceans and atmosphere. In the past, Earth scientists have used observation and, more recently, eigenvalue analysis techniques, such as principal components analysis (PCA) and singular value decomposition (SVD), to discover climate indices. However, eigenvalue techniques are only useful for finding a few of the strongest signals. Furthermore, they impose a condition that all discovered signals must be orthogonal to each other, making it difficult to attach a physical interpretation to them. This paper presents an alternative clustering-based methodology for the discovery of climate indices that overcomes these limitiations and is based on clusters that represent regions with relatively homogeneous behavior. The centroids of these clusters are time series that summarize the behavior of the ocean or atmosphere in those regions. Some of these centroids correspond to known climate indices and provide a validation of our methodology; other centroids are variants of known indices that may provide better predictive power for some land areas; and still other indices may represent potentially new Earth science phenomena. Finally, we show that cluster based indices generally outperform SVD derived indices, both in terms of area weighted correlation and direct correlation with the known indices.

knowledge discovery and data mining | 2004

Exploiting a support-based upper bound of Pearson's correlation coefficient for efficiently identifying strongly correlated pairs

Hui Xiong; Shashi Shekhar; Pang Ning Tan; Vipin Kumar

Given a user-specified minimum correlation threshold θ and a market basket database with N items and T transactions, an all-strong-pairs correlation query finds all item pairs with correlations above the threshold θ. However, when the number of items and transactions are large, the computation cost of this query can be very high. In this paper, we identify an upper bound of Pearsons correlation coefficient for binary variables. This upper bound is not only much cheaper to compute than Pearsons correlation coefficient but also exhibits a special monotone property which allows pruning of many item pairs even without computing their upper bounds. A Two-step All-strong-Pairs corrElation que Ry (TAPER) algorithm is proposed to exploit these properties in a filter-and-refine manner. Furthermore, we provide an algebraic cost model which shows that the computation savings from pruning is independent or improves when the number of items is increased in data sets with common Zipf or linear rank-support distributions. Experimental results from synthetic and real data sets exhibit similar trends and show that the TAPER algorithm can be an order of magnitude faster than brute-force alternatives.

european conference on principles of data mining and knowledge discovery | 2000

Indirect Association: Mining Higher Order Dependencies in Data

Pang Ning Tan; Vipin Kumar; Jaideep Srivastava

This paper introduces a novel pattern called indirect association and examines its utility in various application domains. Existing algorithms for mining associations, such as Apriori, will only discover itemsets that have support above a user-defined threshold. Any itemsets with support below the minimum support requirement are filtered out. We believe that an infrequent pair of items can be useful if the items are related indirectly via some other set of items. In this paper, we propose an algorithm for deriving indirectly associated itempairs and demonstrate the potential application of these patterns in the retail, textual and stock market domains.

Explore More