Jack W. Stokes | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jack W. Stokes is active.

Explore More

Publication

Featured researches published by Jack W. Stokes.

international conference on acoustics, speech, and signal processing | 2013

Large-scale malware classification using random projections and neural networks

George E. Dahl; Jack W. Stokes; Li Deng; Dong Yu

Automatically generated malware is a significant problem for computer users. Analysts are able to manually investigate a small number of unknown files, but the best large-scale defense for detecting malware is automated malware classification. Malware classifiers often use sparse binary features, and the number of potential features can be on the order of tens or hundreds of millions. Feature selection reduces the number of features to a manageable number for training simpler algorithms such as logistic regression, but this number is still too large for more complex algorithms such as neural networks. To overcome this problem, we used random projections to further reduce the dimensionality of the original input space. Using this architecture, we train several very large-scale neural network systems with over 2.6 million labeled samples thereby achieving classification results with a two-class error rate of 0.49% for a single neural network and 0.42% for an ensemble of neural networks.

international world wide web conferences | 2011

ARROW: GenerAting SignatuRes to Detect DRive-By DOWnloads

Junjie Zhang; Christian Seifert; Jack W. Stokes; Wenke Lee

A drive-by download attack occurs when a user visits a webpage which attempts to automatically download malware without the users consent. Attackers sometimes use a malware distribution network (MDN) to manage a large number of malicious webpages, exploits, and malware executables. In this paper, we provide a new method to determine these MDNs from the secondary URLs and redirect chains recorded by a high-interaction client honeypot. In addition, we propose a novel drive-by download detection method. Instead of depending on the malicious content used by previous methods, our algorithm first identifies and then leverages the URLs of the MDNs central servers, where a central server is a common server shared by a large percentage of the drive-by download attacks in the same MDN. A set of regular expression-based signatures are then generated based on the URLs of each central server. This method allows additional malicious webpages to be identified which launched but failed to execute a successful drive-by download attack. The new drive-by detection system named ARROW has been implemented, and we provide a large-scale evaluation on the output of a production drive-by detection system. The experimental results demonstrate the effectiveness of our method, where the detection coverage has been boosted by 96% with an extremely low false positive rate.

international conference on acoustics, speech, and signal processing | 2015

Malware classification with recurrent networks

Razvan Pascanu; Jack W. Stokes; Hermineh Sanossian; Mady Marinescu; Anil Francis Thomas

Attackers often create systems that automatically rewrite and reorder their malware to avoid detection. Typical machine learning approaches, which learn a classifier based on a handcrafted feature vector, are not sufficiently robust to such reorderings. We propose a different approach, which, similar to natural language modeling, learns the language of malware spoken through the executed instructions and extracts robust, time domain features. Echo state networks (ESNs) and recurrent neural networks (RNNs) are used for the projection stage that extracts the features. These models are trained in an unsupervised fashion. A standard classifier uses these features to detect malicious files. We explore a few variants of ESNs and RNNs for the projection stage, including Max-Pooling and Half-Frame models which we propose. The best performing hybrid model uses an ESN for the recurrent model, Max-Pooling for non-linear sampling, and logistic regression for the final classification. Compared to the standard trigram of events model, it improves the true positive rate by 98.3% at a false positive rate of 0.1%.

adversarial information retrieval on the web | 2008

A large-scale study of automated web search traffic

Gregory Buehrer; Jack W. Stokes; Kumar Chellapilla

As web search providers seek to improve both relevance and response times, they are challenged by the ever-increasing tax of automated search query traffic. Third party systems interact with search engines for a variety of reasons, such as monitoring a websites rank, augmenting online games, or possibly to maliciously alter click-through rates. In this paper, we investigate automated traffic in the query stream of a large search engine provider. We define automated traffic as any search query not generated by a human in real time. We first provide examples of different categories of query logs generated by bots. We then develop many different features that distinguish between queries generated by people searching for information, and those generated by automated processes. We categorize these features into two classes, either an interpretation of the physical model of human interactions, or as behavioral patterns of automated interactions. We believe these features formulate a basis for a production-level query stream classifier.

international conference on acoustics, speech, and signal processing | 2008

Nonlinear residual acoustic echo suppression for high levels of harmonic distortion

Diego Ariel Bendersky; Jack W. Stokes; Henrique S. Malvar

Linear adaptive filters are often used for acoustic echo cancellation (AEC) but sometimes fail to perform well in notebook computers and inexpensive telephony devices. Low-quality speakers and poorly-designed enclosures that produce vibrations often generate harmonic distortion, and this nonlinear effect degrades the performance of linear AEC algorithms considerably. In this work, we present a new AEC architecture that consists of a linear, subband adaptive AEC filter followed a nonlinear residual echo suppression (RES) stage specifically designed to address harmonic distortion. In addition to suppressing the residual echo in the primary subband, the proposed model also suppresses the residual echo in a window of bands surrounding the higher order harmonics. Results show considerable improvement over other proposed algorithms, and the new algorithm has much lower implementation costs compared to nonlinear AEC models based on Volterra filters and a previously proposed, nonlinear residual echo suppression algorithm.

international conference on acoustics, speech, and signal processing | 2004

Acoustic echo cancellation with arbitrary playback sampling rate

Jack W. Stokes; Henrique S. Malvar

This paper introduces a new architecture for implementing subband acoustic echo cancellation (AEC) with arbitrary playback sampling rate. Typically, in AEC algorithms for audio or videoconferencing, the sampling rates for the signals played through the speakers and captured from the microphones are identical. For speech recognition while playing CD-quality music and Internet gaming with voice chat, the playback sampling rate is usually higher than the capture rate. A direct solution is to apply a sampling rate converter to the playback signal before feeding it to the AEC, but that is complicated if many sampling frequencies must be supported. We propose a more efficient solution for subband AEC: we perform the sampling rate conversion as a frequency-domain interpolation that matches the transform lengths of the playback and capture signals. Results show that the new AEC architecture has a small computational cost and only a minimal reduction in echo attenuation.

dependable systems and networks | 2013

Detecting malicious landing pages in Malware Distribution Networks

Gang Wang; Jack W. Stokes; Cormac Herley; David Felstead

Drive-by download attacks attempt to compromise a victims computer through browser vulnerabilities. Often they are launched from Malware Distribution Networks (MDNs) consisting of landing pages to attract traffic, intermediate redirection servers, and exploit servers which attempt the compromise. In this paper, we present a novel approach to discovering the landing pages that lead to drive-by downloads. Starting from partial knowledge of a given collection of MDNs we identify the malicious content on their landing pages using multiclass feature selection. We then query the webpage cache of a commercial search engine to identify landing pages containing the same or similar content. In this way we are able to identify previously unknown landing pages belonging to already identified MDNs, which allows us to expand our understanding of the MDN. We explore using both a rule-based and classifier approach to identifying potentially malicious landing pages. We build both systems and independently verify using a high-interaction honeypot that the newly identified landing pages indeed attempt drive-by downloads. For the rule-based system 57% of the landing pages predicted as malicious are confirmed, and this success rate remains constant in two large trials spaced five months apart. This extends the known footprint of the MDNs studied by 17%. The classifier-based system is less successful, and we explore possible reasons.

international conference on detection of intrusions and malware and vulnerability assessment | 2012

Using file relationships in malware classification

Nikos Karampatziakis; Jack W. Stokes; Anil Francis Thomas; Mady Marinescu

Typical malware classification methods analyze unknown files in isolation. However, this ignores valuable relationships between malware files, such as containment in a zip archive, dropping, or downloading. We present a new malware classification system based on a graph induced by file relationships, and, as a proof of concept, analyze containment relationships, for which we have much available data. However our methodology is general, relying only on an initial estimate for some of the files in our data and on propagating information along the edges of the graph. It can thus be applied to other types of file relationships. We show that since malicious files are often included in multiple malware containers, the systems detection accuracy can be significantly improved, particularly at low false positive rates which are the main operating points for automated malware classifiers. For example at a false positive rate of 0.2%, the false negative rate decreases from 42.1% to 15.2%. Finally, the new system is highly scalable; our basic implementation can learn good classifiers from a large, bipartite graph including over 719 thousand containers and 3.4 million files in a total of 16 minutes.

international conference on multimedia and expo | 2007

Normalized Double-Talk Detection Based on Microphone and AEC Error Cross-Correlation

M.A. lqbal; Jack W. Stokes; Steven L. Grant

In this paper, we present two different double-talk detection schemes for Acoustic Echo Cancellation (AEC). First, we present a novel normalized detection statistic based on the cross-correlation coefficient between the microphone signal and the cancellation error. The decision statistic is designed in such a way that it meets the needs of an optimal double-talk detector. We also show that the proposed detection statistic converges to the recently proposed normalized cross-correlation based double-talk detector, the best known cross-correlation based detector. Next, we present a new hybrid double-talk detection scheme based on a cross-correlation coefficient and two signal detectors. The hybrid algorithm not only detects double-talk but also detects and tracks any echo-path variations efficiently. We compare our results with other cross-correlation based double-talk detectors to show their effectiveness.

international conference on detection of intrusions and malware and vulnerability assessment | 2016

MtNet: A Multi-Task Neural Network for Dynamic Malware Classification

Wenyi Huang; Jack W. Stokes

In this paper, we propose a new multi-task, deep learning architecture for malware classification for the binary i.e. malware versus benign malware classification task. All models are trained with data extracted from dynamic analysis of malicious and benign files. For the first time, we see improvements using multiple layers in a deep neural network architecture for malware classification. The system is trained on 4.5 million files and tested on a holdout test set of 2 million files which is the largest study to date. To achieve a binary classification error rate of 0.358i¾?%, the objective functions for the binary classification task and malware family classification task are combined in the multi-task architecture. In addition, we propose a standard i.e. non multi-task malware family classification architecture which also achieves a malware family classification error rate of 2.94i¾?%.

Explore More