Blake Anderson | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Blake Anderson is active.

Explore More

Publication

Featured researches published by Blake Anderson.

Journal of Computer Virology and Hacking Techniques | 2018

Deciphering malware’s use of TLS (without decryption)

Blake Anderson; Subharthi Paul; David A. McGrew

The use of TLS by malware poses new challenges to network threat detection because traditional pattern-matching techniques can no longer be applied to its messages. However, TLS also introduces a complex set of observable data features that allow many inferences to be made about both the client and the server. We show that these features can be used to detect and understand malware communication, while at the same time preserving the privacy of the benign uses of encryption. These data features also allow for accurate malware family attribution of network communication, even when restricted to a single, encrypted flow. To demonstrate this, we performed a detailed study of how TLS is used by malware and enterprise applications. We provide a general analysis on millions of TLS encrypted flows, and a targeted study on 18 malware families composed of thousands of unique malware samples and tens-of-thousands of malicious TLS flows. Importantly, we identify and accommodate for the bias introduced by the use of a malware sandbox. We show that the performance of a malware classifier is correlated with a malware family’s use of TLS, i.e., malware families that actively evolve their use of cryptography are more difficult to classify. We conclude that malware’s usage of TLS is distinct in an enterprise setting, and that these differences can be effectively used in rules and machine learning classifiers.

international conference on network protocols | 2016

Enhanced telemetry for encrypted threat analytics

David A. McGrew; Blake Anderson

Traditional flow monitoring provides a high-level view of network communications by reporting the addresses, ports, and byte and packet counts of a flow. This data is valuable, but it gives little insight into the actual content or context of a flow. To obtain this missing insight, we investigated intra-flow data, that is, information about events that occur inside of a flow that can be conveniently collected, stored, and analyzed within a flow monitoring framework. The focus of our work is on new types of data that are independent of protocol details, such as the lengths and arrival times of messages within a flow. These data elements have the attractive property that they apply equally well to both encrypted and unencrypted flows. Protocol-aware telemetry, specifically TLS-aware telemetry, is also analyzed. In this paper, we explore the benefits of enhanced telemetry, desirable properties of new intra-flow data features with respect to a flow monitoring system, and how best to use machine learning classifiers that operate on this data. We provide results on millions of flows processed by our open source program. Finally, we show that leveraging appropriate data features and simple machine learning models can successfully identify threats in encrypted network traffic.

Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security | 2016

Identifying Encrypted Malware Traffic with Contextual Flow Data

Blake Anderson; David A. McGrew

Identifying threats contained within encrypted network traffic poses a unique set of challenges. It is important to monitor this traffic for threats and malware, but do so in a way that maintains the integrity of the encryption. Because pattern matching cannot operate on encrypted data, previous approaches have leveraged observable metadata gathered from the flow, e.g., the flows packet lengths and inter-arrival times. In this work, we extend the current state-of-the-art by considering a data omnia approach. To this end, we develop supervised machine learning models that take advantage of a unique and diverse set of network flow data features. These data features include TLS handshake metadata, DNS contextual flows linked to the encrypted flow, and the HTTP headers of HTTP contextual flows from the same source IP address within a 5 minute window. We begin by exhibiting the differences between malicious and benign traffics use of TLS, DNS, and HTTP on millions of unique flows. This study is used to design the feature sets that have the most discriminatory power. We then show that incorporating this contextual information into a supervised learning system significantly increases performance at a 0.00% false discovery rate for the problem of classifying encrypted, malicious flows. We further validate our false positive rate on an independent, real-world dataset.

knowledge discovery and data mining | 2017

Machine Learning for Encrypted Malware Traffic Classification: Accounting for Noisy Labels and Non-Stationarity

Blake Anderson; David A. McGrew

The application of machine learning for the detection of malicious network traffic has been well researched over the past several decades; it is particularly appealing when the traffic is encrypted because traditional pattern-matching approaches cannot be used. Unfortunately, the promise of machine learning has been slow to materialize in the network security domain. In this paper, we highlight two primary reasons why this is the case: inaccurate ground truth and a highly non-stationary data distribution. To demonstrate and understand the effect that these pitfalls have on popular machine learning algorithms, we design and carry out experiments that show how six common algorithms perform when confronted with real network data. With our experimental results, we identify the situations in which certain classes of algorithms underperform on the task of encrypted malware traffic classification. We offer concrete recommendations for practitioners given the real-world constraints outlined. From an algorithmic perspective, we find that the random forest ensemble method outperformed competing methods. More importantly, feature engineering was decisive; we found that iterating on the initial feature set, and including features suggested by domain experts, had a much greater impact on the performance of the classification system. For example, linear regression using the more expressive feature set easily outperformed the random forest method using a standard network traffic representation on all criteria considered. Our analysis is based on millions of TLS encrypted sessions collected over 12 months from a commercial malware sandbox and two geographically distinct, large enterprise networks.

Statistical Analysis and Data Mining | 2017

Order priors for Bayesian network discovery with an application to malware phylogeny

Diane Oyen; Blake Anderson; Kari Sentz; Christine M. Anderson-Cook

Bayesian networks have been used extensively to model and discover dependency relationships among sets of random variables. We learn Bayesian network structure with a combination of human knowledge about the partial ordering of variables and statistical inference of conditional dependencies from observed data. Our approach leverages complementary information from human knowledge and inference from observed data to produce networks that reflect human beliefs about the system as well as to fit the observed data. Applying prior beliefs about partial orderings of variables is an approach distinctly different from existing methods that incorporate prior beliefs about direct dependencies (or edges) in a Bayesian network. We provide an efficient implementation of the partial-order prior in a Bayesian structure discovery learning algorithm, as well as an edge prior, showing that both priors meet the local modularity requirement necessary for an efficient Bayesian discovery algorithm. In benchmark studies, the partial-order prior improves the accuracy of Bayesian network structure learning as well as the edge prior, even though order priors are more general. Our primary motivation is in characterizing the evolution of families of malware to aid cyber security analysts. For the problem of malware phylogeny discovery, we find that our algorithm, compared to existing malware phylogeny algorithms, more accurately discovers true dependencies that are missed by other algorithms.

national conference on artificial intelligence | 2016