Scott Charles Evans | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Scott Charles Evans is active.

Explore More

Publication

Featured researches published by Scott Charles Evans.

international conference on data mining | 2011

Time Series Epenthesis: Clustering Time Series Streams Requires Ignoring Some Data

Thanawin Rakthanmanon; Eamonn J. Keogh; Stefano Lonardi; Scott Charles Evans

Given the pervasiveness of time series data in all human endeavors, and the ubiquity of clustering as a data mining application, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series, e.g., gene expression profiles, individual heartbeats or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the Minimum Description Length (MDL) framework offers an efficient, effective and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, zoology and industrial process analyses.

international conference on data mining | 2011

Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL

Bing Hu; Thanawin Rakthanmanon; Yuan Hao; Scott Charles Evans; Stefano Lonardi; Eamonn J. Keogh

Most algorithms for mining or indexing time series data do not operate directly on the original data, but instead they consider alternative representations that include transforms, quantization, approximation, and multi-resolution abstractions. Choosing the best representation and abstraction level for a given task/dataset is arguably the most critical step in time series data mining. In this paper, we investigate techniques to discover the natural intrinsic representation model, dimensionality and alphabet cardinality of a time series. The ability to discover these intrinsic features has implications beyond selecting the best parameters for particular algorithms, as characterizing data in such a manner is useful in its own right and an important sub-routine in algorithms for classification, clustering and outlier discovery. We will frame the discovery of these intrinsic features in the Minimal Description Length (MDL) framework. Extensive empirical tests show that our method is simpler, more general and significantly more accurate than previous methods, and has the important advantage of being essentially parameter-free.

Eurasip Journal on Bioinformatics and Systems Biology | 2007

MicroRNA target detection and analysis for genes related to breast cancer using MDLcompress

Scott Charles Evans; Antonis Kourtidis; T. Stephen Markham; Jonathan Miller; Douglas S. Conklin; Andrew Soliz Torres

We describe initial results of miRNA sequence analysis with the optimal symbol compression ratio (OSCR) algorithm and recast this grammar inference algorithm as an improved minimum description length (MDL) learning tool: MDLcompress. We apply this tool to explore the relationship between miRNAs, single nucleotide polymorphisms (SNPs), and breast cancer. Our new algorithm outperforms other grammar-based coding methods, such as DNA Sequitur, while retaining a two-part code that highlights biologically significant phrases. The deep recursion of MDLcompress, together with its explicit two-part coding, enables it to identify biologically meaningful sequence without needlessly restrictive priors. The ability to quantify cost in bits for phrases in the MDL model allows prediction of regions where SNPs may have the most impact on biological activity. MDLcompress improves on our previous algorithm in execution time through an innovative data structure, and in specificity of motif detection (compression) through improved heuristics. An MDLcompress analysis of 144 over expressed genes from the breast cancer cell line BT474 has identified novel motifs, including potential microRNA (miRNA) binding sites that are candidates for experimental validation.

Knowledge and Information Systems | 2012

MDL-based time series clustering

Thanawin Rakthanmanon; Eamonn J. Keogh; Stefano Lonardi; Scott Charles Evans

Time series data are pervasive across all human endeavors, and clustering is arguably the most fundamental data mining application. Given this, it is somewhat surprising that the problem of time series clustering from a single stream remains largely unsolved. Most work on time series clustering considers the clustering of individual time series that have been carefully extracted from their original context, for example, gene expression profiles, individual heartbeats, or individual gait cycles. The few attempts at clustering time series streams have been shown to be objectively incorrect in some cases, and in other cases shown to work only on the most contrived synthetic datasets by carefully adjusting a large set of parameters. In this work, we make two fundamental contributions that allow for the first time, the meaningful clustering of subsequences from a time series stream. First, we show that the problem definition for time series clustering from streams currently used is inherently flawed, and a new definition is necessary. Second, we show that the minimum description length framework offers an efficient, effective, and essentially parameter-free method for time series clustering. We show that our method produces objectively correct results on a wide variety of datasets from medicine, speech recognition, zoology, gesture recognition, and industrial process analyses.

darpa information survivability conference and exposition | 2001

Information assurance through Kolmogorov complexity

Scott Charles Evans; Stephen F. Bush; John Erik Hershey

The problem of information assurance is approached from the point of view of Kolmogorov complexity and minimum message length criteria. Several theoretical results are obtained, possible applications are discussed and a new metric for measuring complexity is introduced. Utilization of Kolmogorov complexity like metrics as conserved parameters to detect abnormal system behavior is explored. Data and process vulnerabilities are put forward as two different dimensions of vulnerability that can be discussed in terms of Kolmogorov complexity. Finally, these results are utilized to conduct complexity-based vulnerability analysis.

military communications conference | 2007

MDLcompress for Intrusion Detection: Signature Inference and Masquerade Attack

Scott Charles Evans; Earl Eiland; Stephen Markham; Jeremy Impson; Adam Laczo

MDLcompress is a grammar inference algorithm that uses Minimum Description Length principles from the theory of Kolmogorov Complexity and Algorithmic Information Theory to infer a grammar, finding patterns and motifs that aid most in compressing unknown data sets. This technology has been applied to detection of FTP exploits and inference of DNA sequence motifs related to breast cancer. In this paper we apply MDLcompress to infer grammars, and then apply those grammars to identify masquerades in the publicly available Schonlau system call data sets. Compared to similar protocols our system detects anomalous events with comparable performance with the advantage of executing in linear time.

asilomar conference on signals, systems and computers | 2006

An Improved Minimum Description Length Learning Algorithm for Nucleotide Sequence Analysis

Scott Charles Evans; Steve Markham; Andrew Soliz Torres; Antonis Kourtidis; Douglas S. Conklin

We present an improved minimum description length (MDL) learning algorithm - MDLCompress - for nucleotide sequence analysis that outperforms the compression of other Grammar Based Coding methods such as DNA Sequitur while retaining a two-part code that highlights biologically significant phrases. Phrases are recursively added to the MDLCompress model that are not necessarily the longest matches, or the most often repeated phrase of a certain length, but a combination of length and repetition such that inclusion of the phrase in the model maximizes compression. The deep recursion of MDLCompress combined with its two-part coding nature makes it uniquely able to identify biologically meaningful sequence without limiting assumptions. The ability to quantify cost in bits for phrases in the MDL model promotes prediction of fragile regions where single nucleotide polymorphisms (SNPs) may have the most impact on biological activity. MDLCompress improves our previous algorithm in runtime performance through use of an innovative data structure and in specificity of motif detection (compression) through use of improved heuristics. We also discuss recent results from MDLCompress analysis of 144 known overexpressed genes from a breast cancer cell line, BT474. Novel motifs, including potential microRNA (miRNA) binding sites, have been identified within certain genes and are being considered for in vitro validation studies.

military communications conference | 2009

Network attack visualization and response through intelligent icons

Scott Charles Evans; T. Stephen Markham; Richard Bejtlich; Bruce Gordon Barnett; Bernhard Joseph Scholz; Robert James Mitchell; Weizhong Yan; Eric Steinbrecher; Jeremy Impson

Determination of appropriate response to information system attack is jointly determined by confidence of classification, nature (type) of attack, and confidence in effectiveness of response. In this paper we present a technique to rapidly assess similarity of observed behavior to attack or normal models: displaying the similarity of observed data to learned Minimum Description Length Models for normal and attack behaviors using “intelligent icons”. These icons provide a visual indication of similarity to normal and attack signatures and can alert human operators to the key motifs and signatures that affect confidence in classification and indicated response.

quality of service in heterogeneous wired wireless networks | 2005

Route based QoS and the biased early drop algorithm (BED)

Scott Charles Evans; Marc Robert Pearlman; Michael James Hartman; Asavari Rothe; Martin W. Egan; Manny Leiva

DiffServ offers an attractive solution to quality of service (QoS) for mobile ad-hoc networks (MANET)s since the overhead of alternative flow based QoS metrics and signaling is not required. However, within prioritized classes and in the presence of dynamically forming bottlenecks, DiffServ can lead to brittle failure modes in which all flows are highly penalized and subsequently fail rather than maintaining some flows at required QoS levels for latency or packet loss. This problem is particularly difficult for UDP traffic, which does not respond to random early detection (RED) like throttling mechanisms in the presence of congestion. This paper proposes an augmentation of DiffServ QoS over MANET that utilizes metrics available through routing protocols to prevent, and resolve congestion. This is done in a manner that promotes maintenance of some high priority UDP (for example voice over IP) flows in the presence of bottlenecks. The biased early drop (BED) algorithm is introduced to maintain high continuity of UDP flows in the presence of congestion within the same class of service

ieee aerospace conference | 2014

Towards wind farm performance optimization through empirical models

Scott Charles Evans; Zhanpan Zhang; Satish Iyengar; Jianhui Chen; John Hilton; Peter Gregg; David Eldridge; Mark Jonkhof; Colin Craig McCulloch; Mohammad Shokoohi-Yekta

Wind Turbine performance improvement measurements are challenging, especially when improvements affect air flow to the nacelle anemometer sensor which is often used to baseline performance. Uncertainty in this area can impede optimization of wind farms by making it difficult to show the benefit of upgrades to individual turbines, jointly optimize wind turbine performance in a farm, and validate the effects of optimization algorithms - particularly farm level algorithms and strategies that mitigate waking affects. In this paper we introduce methods that augment traditional methods for baselining wind turbine performance using multi-feature estimation based on empirical data and present a method for normalizing AEP uncertainty estimates. This innovative method does not rely solely on nacelle anemometer estimates or expensive additional sensors, as has been the historical approach but can leverage these trusted sensors if they are available. Future directions for whole farm optimizations are discussed.

Explore More