David A. Cieslak
University of Notre Dame
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David A. Cieslak.
Data Mining and Knowledge Discovery | 2008
Nitesh V. Chawla; David A. Cieslak; Lawrence O. Hall; Ajay Joshi
Learning from imbalanced data sets presents a convoluted problem both from the modeling and cost standpoints. In particular, when a class is of great interest but occurs relatively rarely such as in cases of fraud, instances of disease, and regions of interest in large-scale simulations, there is a correspondingly high cost for the misclassification of rare events. Under such circumstances, the data set is often re-sampled to generate models with high minority class accuracy. However, the sampling methods face a common, but important, criticism: how to automatically discover the proper amount and type of sampling? To address this problem, we propose a wrapper paradigm that discovers the amount of re-sampling for a data set based on optimizing evaluation functions like the f-measure, Area Under the ROC Curve (AUROC), cost, cost-curves, and the cost dependent f-measure. Our analysis of the wrapper is twofold. First, we report the interaction between different evaluation and wrapper optimization functions. Second, we present a set of results in a cost- sensitive environment, including scenarios of unknown or changing cost matrices. We also compared the performance of the wrapper approach versus cost-sensitive learning methods—MetaCost and the Cost-Sensitive Classifiers—and found the wrapper to outperform the cost-sensitive classifiers in a cost-sensitive environment. Lastly, we obtained the lowest cost per test example compared to any result we are aware of for the KDD-99 Cup intrusion detection data set.
european conference on machine learning | 2008
David A. Cieslak; Nitesh V. Chawla
Learning from unbalanced datasets presents a convoluted problem in which traditional learning algorithms may perform poorly. The objective functions used for learning the classifiers typically tend to favor the larger, less important classes in such problems. This paper compares the performance of several popular decision tree splitting criteria --- information gain, Gini measure, and DKM --- and identifies a new skew insensitive measure in Hellinger distance. We outline the strengths of Hellinger distance in class imbalance, proposes its application in forming decision trees, and performs a comprehensive comparative analysis between each decision tree construction method. In addition, we consider the performance of each tree within a powerful sampling wrapper framework to capture the interaction of the splitting metric and sampling. We evaluate over this wide range of datasets and determine which operate best under class imbalance.
Data Mining and Knowledge Discovery | 2012
David A. Cieslak; T. Ryan Hoens; Nitesh V. Chawla; W. Philip Kegelmeyer
Learning from imbalanced data is an important and common problem. Decision trees, supplemented with sampling techniques, have proven to be an effective way to address the imbalanced data problem. Despite their effectiveness, however, sampling methods add complexity and the need for parameter selection. To bypass these difficulties we propose a new decision tree technique called Hellinger Distance Decision Trees (HDDT) which uses Hellinger distance as the splitting criterion. We analytically and empirically demonstrate the strong skew insensitivity of Hellinger distance and its advantages over popular alternatives such as entropy (gain ratio). We apply a comprehensive empirical evaluation framework testing against commonly used sampling and ensemble methods, considering performance across 58 varied datasets. We demonstrate the superiority (using robust tests of statistical significance) of HDDT on imbalanced data, as well as its competitive performance on balanced datasets. We thereby arrive at the particularly practical conclusion that for imbalanced data it is sufficient to use Hellinger trees with bagging (BG) without any sampling methods. We provide all the datasets and software for this paper online (http://www.nd.edu/~dial/hddt).
granular computing | 2006
David A. Cieslak; Nitesh V. Chawla; Aaron Striegel
An approach to combating network intrusion is the development of systems applying machine learning and data min- ing techniques. Many IDS (Intrusion Detection Systems) suffer from a high rate of false alarms and missed intrusions. We want to be able to improve the intrusion detection rate at a reduced false positive rate. The focus of this paper is rule-learning, using RIPPER, on highly imbalanced intrusion datasets with an objective to improve the true positive rate (intrusions) without significantly increasing the false positives. We use RIPPER as the underlying rule classifier. To counter imbalance in data, we implement a combination of oversampling (both by replication and synthetic generation) and undersampling techniques. We also propose a clustering based methodology for oversampling by generating synthetic instances. We evaluate our approaches on two intrusion datasets — destination and actual packets based — constructed from actual Notre Dame traffic, giving a flavor of real-world data with its idiosyncrasies. Using ROC analysis, we show that oversampling by synthetic generation of minority (intrusion) class outperforms oversampling by replication and RIPPERs loss ratio method. Additionally, we establish that our clustering based approach is more suitable for the detecting intrusions and is able to provide additional improvement over just synthetic generation of instances.
ACM Transactions on Information and System Security | 2008
Chad D. Mano; Andrew Blaich; Qi Liao; Yingxin Jiang; David A. Cieslak; David Salyers; Aaron Striegel
Wireless network access has become an integral part of computing both at home and at the workplace. The convenience of wireless network access at work may be extremely beneficial to employees, but can be a burden to network security personnel. This burden is magnified by the threat of inexpensive wireless access points being installed in a network without the knowledge of network administrators. These devices, termed <it>Rogue Wireless Access Points</it>, may allow a malicious outsider to access valuable network resources, including confidential communication and other stored data. For this reason, wireless connectivity detection is an essential capability, but remains a difficult problem. We present a method of detecting wireless hosts using a local RTT metric and a novel packet payload slicing technique. The local RTT metric provides the means to identify physical transmission media while packet payload slicing conditions network traffic to enhance the accuracy of the detections. Most importantly, the packet payload slicing method is transparent to both clients and servers and does not require direct communication between the monitoring system and monitored hosts.
Knowledge and Information Systems | 2009
David A. Cieslak; Nitesh V. Chawla
Classifier error is the product of model bias and data variance. While understanding the bias involved when selecting a given learning algorithm, it is similarly important to understand the variability in data over time, since even the One True Model might perform poorly when training and evaluation samples diverge. Thus, it becomes the ability to identify distributional divergence is critical towards pinpointing when fracture points in classifier performance will occur, particularly since contemporary methods such as tenfolds and hold-out are poor predictors in divergent circumstances. This article implement a comprehensive evaluation framework to proactively detect breakpoints in classifiers’ predictions and shifts in data distributions through a series of statistical tests. We outline and utilize three scenarios under which data changes: sample selection bias, covariate shift, and shifting class priors. We evaluate the framework with a variety of classifiers and datasets.
international conference on data mining | 2008
David A. Cieslak; Nitesh V. Chawla
Class imbalance is a ubiquitous problem in supervised learning and has gained wide-scale attention in the literature. Perhaps the most prevalent solution is to apply sampling to training data in order improve classifier performance. The typical approach will apply uniform levels of sampling globally. However, we believe that data is typically multi-modal, which suggests sampling should be treated locally rather than globally. It is the purpose of this paper to propose a framework which first identifies meaningful regions of data and then proceeds to find optimal sampling levels within each. This paper demonstrates that a global classifier trained on data locally sampled produces superior rank-orderings on a wide range of real-world and artificial datasets as compared to contemporary global sampling methods.
grid computing | 2008
David A. Cieslak; Nitesh V. Chawla; Douglas Thain
Large scale production computing grids introduce new challenges in debugging and troubleshooting. A user that submits a workload consisting of tens of thousands of jobs to a grid of thousands of processors has a good chance of receiving thousands of error messages as a result. How can one begin to reason about such problems? We propose that data mining techniques can be employed to classify failures according to the properties of the jobs and machines involved. We demonstrate this technique through several case studies on real workloads consisting of tens of thousands of jobs. We apply the same techniques to a yearpsilas worth of data on a 3000 CPU production grid and use it to gain a high level understanding of the system behavior.
knowledge discovery and data mining | 2008
David A. Cieslak; Nitesh V. Chawla
Many machine learning applications like finance, medicine, and risk management suffer from class imbalance: cases of interest occur rarely. Further complicating these applications is that the training and testing samples might differ significantly in their respective class distributions. Sampling has been shown to be a strong solution to imbalance and additionally offers a rich parameter space from which to select classifiers. This paper is concerned with the interaction between Probability Estimation Trees (PETs) [1], sampling, and performance metrics as testing distributions fluctuate substantially. A set of comprehensive analyses is presented, which anticipate classifier performance through a set of widely varying testing distributions.
high performance distributed computing | 2006
David A. Cieslak; Douglas Thain; Nitesh V. Chawla
Through massive parallelism, distributed systems enable the multiplication of productivity. Unfortunately, increas- ing the scale of available machines to users will also mul- tiply debugging when failure occurs. Data mining allows the extraction of patterns within large amounts of data and therefore forms the foundation for a useful method of de- bugging, particularly within such distributed systems. This paper outlines a successful application of data mining in troubleshooting distributed systems, proposes a framework for further study, and speculates on other future work. We propose that data mining techniques can be applied to the problem of large scale troubleshooting. If both jobs and the resources that they consume are annotated with structured information relevant to success or failure, then classification algorithms can be used to find properties of each that correlate with success or failure. In the one- million jobs example above, an ideal troubleshooter would report to the user something like: Your jobs always fail on Linux 2.8 machines, always fail on cluster X between mid- night and 6 A.M, and fail with 50% probability on machines owned by user Y. Further, these discoveries may be used to automatically avoid making bad placement decisions that waste time and resources. We hasten to note that this form of data mining is not a panacea. It does not explain why failures happen, or make any attempt to diagnose problems in fine detail. It only proposes to the user properties correlated with suc- cess or failure. Other tools and techniques may be applied to extract causes. Rather, data mining allows the user of a large system to rapidly make generalizations to improve the throughput and reliability of a system without engag- ing in low level debugging. These generalizations may be used later at leisure to locate and repair problems. In addi- tion, the problem of distributed debugging, with its unique idiosyncracies and dynamics, lends itself as a compelling application for data mining research. Standard off-the-shelf methodologies might not be directly applicable for large, dynamic, and evolving system. It is desired to implement techniques that are capable of incremental self-revision and adaptation. The goal of our paper is to serve as a proof-of- concept and identify venues for compelling future research.