Salvatore J. Stolfo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Salvatore J. Stolfo is active.

Explore More

Publication

Featured researches published by Salvatore J. Stolfo.

usenix security symposium | 1998

Data mining approaches for intrusion detection

Wenke Lee; Salvatore J. Stolfo

In this paper we discuss our research in developing general and systematic methods for intrusion detection. The key ideas are to use data mining techniques to discover consistent and useful patterns of system features that describe program and user behavior, and use the set of relevant system features to compute (inductively learned) classifiers that can recognize anomalies and known intrusions. Using experiments on the sendmail system call data and the network tcpdump data, we demonstrate that we can construct concise and accurate classifiers to detect anomalies. We provide an overview on two general data mining algorithms that we have implemented: the association rules algorithm and the frequent episodes algorithm. These algorithms can be used to compute the intra-and inter-audit record patterns, which are essential in describing program or user behavior. The discovered patterns can guide the audit data gathering process and facilitate feature selection. To meet the challenges of both efficient learning (mining) and real-time detection, we propose an agent-based architecture for intrusion detection systems where the learning agents continuously compute and provide the updated (detection) models to the detection agents.

ieee symposium on security and privacy | 1999

A data mining framework for building intrusion detection models

Wenke Lee; Salvatore J. Stolfo; Kui W. Mok

There is often the need to update an installed intrusion detection system (IDS) due to new attack methods or upgraded computing environments. Since many current IDSs are constructed by manual encoding of expert knowledge, changes to IDSs are expensive and slow. We describe a data mining framework for adaptively building Intrusion Detection (ID) models. The central idea is to utilize auditing programs to extract an extensive set of features that describe each network connection or host session, and apply data mining programs to learn rules that accurately capture the behavior of intrusions and normal activities. These rules can then be used for misuse detection and anomaly detection. New detection models are incorporated into an existing IDS through a meta-learning (or co-operative learning) process, which produces a meta detection model that combines evidence from multiple models. We discuss the strengths of our data mining programs, namely, classification, meta-learning, association rules, and frequent episodes. We report on the results of applying these programs to the extensively gathered network audit data for the 1998 DARPA Intrusion Detection Evaluation Program.

ACM Transactions on Information and System Security | 2000

A framework for constructing features and models for intrusion detection systems

Wenke Lee; Salvatore J. Stolfo

Intrusion detection (ID) is an important component of infrastructure protection mechanisms. Intrusion detection systems (IDSs) need to be accurate, adaptive, and extensible. Given these requirements and the complexities of todays network environments, we need a more systematic and automated IDS development process rather that the pure knowledge encoding and engineering approaches. This article describes a novel framework, MADAM ID, for Mining Audit Data for Automated Models for Instrusion Detection. This framework uses data mining algorithms to compute activity patterns from system audit data and extracts predictive features from the patterns. It then applies machine learning algorithms to the audit records taht are processed according to the feature definitions to generate intrusion detection rules. Results from the 1998 DARPA Intrusion Detection Evaluation showed that our ID model was one of the best performing of all the participating systems. We also briefly discuss our experience in converting the detection models produced by off-line data mining programs to real-time modules of existing IDSs.

international conference on management of data | 1995

The merge/purge problem for large databases

Mauricio A. Hernández; Salvatore J. Stolfo

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.

recent advances in intrusion detection | 2004

Anomalous Payload-Based Network Intrusion Detection

Ke Wang; Salvatore J. Stolfo

We present a payload-based anomaly detector, we call PAYL, for intrusion detection. PAYL models the normal application payload of network traffic in a fully automatic, unsupervised and very effecient fashion. We first compute during a training phase a profile byte frequency distribution and their standard deviation of the application payload flowing to a single host and port. We then use Mahalanobis distance during the detection phase to calculate the similarity of new data against the pre-computed profile. The detector compares this measure against a threshold and generates an alert when the distance of the new input exceeds this threshold. We demonstrate the surprising effectiveness of the method on the 1999 DARPA IDS dataset and a live dataset we collected on the Columbia CS department network. In once case nearly 100% accuracy is achieved with 0.1% false positive rate for port 80 traffic.

Data Mining and Knowledge Discovery | 1998

Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem

Mauricio A. Hernández; Salvatore J. Stolfo

The problem of merging multiple databases of information about common entities is frequently encountered in KDD and decision support applications in large commercial and government organizations. The problem we study is often called the Merge/Purge problem and is difficult to solve both in scale and accuracy. Large repositories of data typically have numerous duplicate information entries about the same entities that are difficult to cull together without an intelligent “equational theory” that identifies equivalent items by a complex, domain-dependent matching process. We have developed a system for accomplishing this Data Cleansing task and demonstrate its use for cleansing lists of names of potential customers in a direct marketing-type application. Our results for statistically generated data are shown to be accurate and effective when processing the data multiple times using different keys for sorting on each successive pass. Combing results of individual passes using transitive closure over the independent results, produces far more accurate results at lower cost. The system provides a rule programming module that is easy to program and quite good at finding duplicates especially in an environment with massive amounts of data. This paper details improvements in our system, and reports on the successful implementation for a real-world database that conclusively validates our results previously achieved for statistically generated data.

ieee symposium on security and privacy | 2001

Data mining methods for detection of new malicious executables

Matthew G. Schultz; Eleazar Eskin; F. Zadok; Salvatore J. Stolfo

A serious security threat today is malicious executables, especially new, unseen malicious executables often arriving as email attachments. These new malicious executables are created at the rate of thousands every year and pose a serious security threat. Current anti-virus systems attempt to detect these new malicious programs with heuristics generated by hand. This approach is costly and oftentimes ineffective. We present a data mining framework that detects new, previously unseen malicious executables accurately and automatically. The data mining framework automatically found patterns in our data set and used these patterns to detect a set of new malicious binaries. Comparing our detection methods with a traditional signature-based method, our method more than doubles the current detection rates for new malicious executables.

darpa information survivability conference and exposition | 2000

Cost-based modeling for fraud and intrusion detection: results from the JAM project

Salvatore J. Stolfo; Wei Fan; Wenke Lee; Andreas L. Prodromidis; Philip K. Chan

We describe the results achieved using the JAM distributed data mining system for the real world problem of fraud detection in financial information systems. For this domain we provide clear evidence that state-of-the-art commercial fraud detection systems can be substantially improved in stopping losses due to fraud by combining multiple models of fraudulent transaction shared among banks. We demonstrate that the traditional statistical metrics used to train and evaluate the performance of learning systems (ie. statistical accuracy or ROC analysis) are misleading and perhaps inappropriate for this application. Cost-based metrics are more relevant in certain domains, and defining such metrics poses significant and interesting research questions both in evaluating systems and alternative models, and in formalizing the problems to which one may wish to apply data mining technologies. This paper also demonstrates how the techniques developed for fraud detection can be generalized and applied to the important area of intrusion detection in networked information systems. We report the outcome of recent evaluations of our system applied to tcpdump network intrusion data specifically with respect to statistical accuracy. This work involved building additional components of JAM that we have come to call, MADAM ID (Mining Audit Data for Automated Models for Intrusion Detection). However, taking the next step to define cost-based models for intrusion detection poses interesting new research questions. We describe our initial ideas about how to evaluate intrusion detection systems using cost models learned during our work on fraud detection.

Artificial Intelligence Review | 2000

Adaptive Intrusion Detection: A Data Mining Approach

Wenke Lee; Salvatore J. Stolfo; Kui W. Mok

In this paper we describe a data mining framework for constructingintrusion detection models. The first key idea is to mine system auditdata for consistent and useful patterns of program and user behavior.The other is to use the set of relevant system features presented inthe patterns to compute inductively learned classifiers that canrecognize anomalies and known intrusions. In order for the classifiersto be effective intrusion detection models, we need to have sufficientaudit data for training and also select a set of predictive systemfeatures. We propose to use the association rules and frequentepisodes computed from audit data as the basis for guiding the auditdata gathering and feature selection processes. We modify these twobasic algorithms to use axis attribute(s) and referenceattribute(s) as forms of item constraints to compute only therelevant patterns. In addition, we use an iterative level-wiseapproximate mining procedure to uncover the low frequency butimportant patterns. We use meta-learning as a mechanism to makeintrusion detection models more effective and adaptive. We report ourextensive experiments in using our framework on real-world audit data.

recent advances in intrusion detection | 2006

Anagram: a content anomaly detector resistant to mimicry attack

Ke Wang; Janak J. Parekh; Salvatore J. Stolfo

In this paper, we present Anagram, a content anomaly detector that models a mixture ofhigh-order n-grams (n > 1) designed to detect anomalous and “suspicious” network packet payloads. By using higher-order n-grams, Anagram can detect significant anomalous byte sequences and generate robust signatures of validated malicious packet content. The Anagram content models are implemented using highly efficient Bloom filters, reducing space requirements and enabling privacy-preserving cross-site correlation. The sensor models the distinct content flow of a network or host using a semi-supervised training regimen. Previously known exploits, extracted from the signatures of an IDS, are likewise modeled in a Bloom filter and are used during training as well as detection time. We demonstrate that Anagram can identify anomalous traffic with high accuracy and low false positive rates. Anagrams high-order n-gram analysis technique is also resilient against simple mimicry attacks that blend exploits with “normal” appearing byte padding, such as the blended polymorphic attack recently demonstrated in [1]. We discuss randomized n-gram models, which further raises the bar and makes it more difficult for attackers to build precise packet structures to evade Anagram even if they know the distribution of the local site content flow. Finally, Anagrams speed and high detection rate makes it valuable not only as a standalone sensor, but also as a network anomaly flow classifier in an instrumented fault-tolerant host-based environment; this enables significant cost amortization and the possibility of a “symbiotic” feedback loop that can improve accuracy and reduce false positive rates over time.

Explore More