Is this you? Create Your Porfile

Justin Ma

University of California, San Diego

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Justin Ma is active.

Explore More

Publication

Featured researches published by Justin Ma.

knowledge discovery and data mining | 2009

Beyond blacklists: learning to detect malicious web sites from suspicious URLs

Justin Ma; Lawrence K. Saul; Stefan Savage; Geoffrey M. Voelker

Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on automated URL classification, using statistical methods to discover the tell-tale lexical and host-based properties of malicious Web site URLs. These methods are able to learn highly predictive models by extracting and automatically analyzing tens of thousands of features potentially indicative of suspicious URLs. The resulting classifiers obtain 95-99% accuracy, detecting large numbers of malicious Web sites from their URLs, with only modest false positives.

international conference on machine learning | 2009

Identifying suspicious URLs: an application of large-scale online learning

Justin Ma; Lawrence K. Saul; Stefan Savage; Geoffrey M. Voelker

This paper explores online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving classification accuracies up to 99% over a balanced data set.

acm special interest group on data communication | 2011

Managing data transfers in computer clusters with orchestra

Mosharaf Chowdhury; Matei Zaharia; Justin Ma; Michael I. Jordan; Ion Stoica

Cluster computing applications like MapReduce and Dryad transfer massive amounts of data between their computation stages. These transfers can have a significant impact on job performance, accounting for more than 50% of job completion times. Despite this impact, there has been relatively little work on optimizing the performance of these data transfers, with networking researchers traditionally focusing on per-flow traffic management. We address this limitation by proposing a global management architecture and a set of algorithms that (1) improve the transfer times of common communication patterns, such as broadcast and shuffle, and (2) allow scheduling policies at the transfer level, such as prioritizing a transfer over other transfers. Using a prototype implementation, we show that our solution improves broadcast completion times by up to 4.5X compared to the status quo in Hadoop. We also show that transfer-level scheduling can reduce the completion time of high-priority transfers by 1.7X.

symposium on operating systems principles | 2005

Scalability, fidelity, and containment in the potemkin virtual honeyfarm

Michael Vrable; Justin Ma; Jay Chen; David Moore; Erik Vandekieft; Alex C. Snoeren; Geoffrey M. Voelker; Stefan Savage

The rapid evolution of large-scale worms, viruses and bot-nets have made Internet malware a pressing concern. Such infections are at the root of modern scourges including DDoS extortion, on-line identity theft, SPAM, phishing, and piracy. However, the most widely used tools for gathering intelligence on new malware -- network honeypots -- have forced investigators to choose between monitoring activity at a large scale or capturing behavior with high fidelity. In this paper, we describe an approach to minimize this tension and improve honeypot scalability by up to six orders of magnitude while still closely emulating the execution behavior of individual Internet hosts. We have built a prototype honeyfarm system, called Potemkin, that exploits virtual machines, aggressive memory sharing, and late binding of resources to achieve this goal. While still an immature implementation, Potemkin has emulated over 64,000 Internet honeypots in live test runs, using only a handful of physical servers.

internet measurement conference | 2006

Unexpected means of protocol inference

Justin Ma; Kirill Levchenko; Christian Kreibich; Stefan Savage; Geoffrey M. Voelker

Network managers are inevitably called upon to associate network traffic with particular applications. Indeed, this operation is critical for a wide range of management functions ranging from debugging and security to analytics and policy support. Traditionally, managers have relied on application adherence to a well established global port mapping: Web traffic on port 80, mail traffic on port 25 and so on. However, a range of factors - including firewall port blocking, tunneling, dynamic port allocation, and a bloom of new distributed applications - has weakened the value of this approach. We analyze three alternative mechanisms using statistical and structural content models for automatically identifying traffic that uses the same application-layer protocol, relying solely on flow content. In this manner, known applications may be identified regardless of port number, while traffic from one unknown application will be identified as distinct from another. We evaluate each mechanisms classification performance using real-world traffic traces from multiple sites.

ACM Transactions on Intelligent Systems and Technology | 2011

Learning to detect malicious URLs

Justin Ma; Lawrence K. Saul; Stefan Savage; Geoffrey M. Voelker

Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical and host-based features of their URLs. We show that this problem lends itself naturally to modern algorithms for online learning. Online algorithms not only process large numbers of URLs more efficiently than batch algorithms, they also adapt more quickly to new features in the continuously evolving distribution of malicious URLs. We develop a real-time system for gathering URL features and pair it with a real-time feed of labeled URLs from a large Web mail provider. From these features and labels, we are able to train an online classifier that detects malicious Web sites with 99% accuracy over a balanced dataset.

Proceedings of the IEEE | 2006

Wireless Network Security and Interworking

Minho Shin; Justin Ma; Arunesh Mishra; William A. Arbaugh

A variety of wireless technologies have been standardized and commercialized, but no single technology is considered the best because of different coverage and bandwidth limitations. Thus, interworking between heterogeneous wireless networks is extremely important for ubiquitous and high-performance wireless communications. Security in interworking is a major challenge due to the vastly different security architectures used within each network. The goal of this paper is twofold. First, we provide a comprehensive discussion of security problems and current technologies in 3G and WLAN systems. Second, we provide introductory discussions about the security problems in interworking, the state-of-the-art solutions, and open problems.

symposium on cloud computing | 2011

Scaling the mobile millennium system in the cloud

Timothy Hunter; Teodor Mihai Moldovan; Matei Zaharia; Samy Merzgui; Justin Ma; Michael J. Franklin; Pieter Abbeel; Alexandre M. Bayen

We report on our experience scaling up the Mobile Millennium traffic information system using cloud computing and the Spark cluster computing framework. Mobile Millennium uses machine learning to infer traffic conditions for large metropolitan areas from crowdsourced data, and Spark was specifically designed to support such applications. Many studies of cloud computing frameworks have demonstrated scalability and performance improvements for simple machine learning algorithms. Our experience implementing a real-world machine learning-based application corroborates such benefits, but we also encountered several challenges that have not been widely reported. These include: managing large parameter vectors, using memory efficiently, and integrating with the applications existing storage infrastructure. This paper describes these challenges and the changes they required in both the Spark framework and the Mobile Millennium software. While we focus on a system for traffic estimation, we believe that the lessons learned are applicable to other machine learning-based applications.

internet measurement conference | 2006

Finding diversity in remote code injection exploits

Justin Ma; John Dunagan; Helen J. Wang; Stefan Savage; Geoffrey M. Voelker

Remote code injection exploits inflict a significant societal cost, and an active underground economy has grown up around these continually evolving attacks. We present a methodology for inferring the phylogeny, or evolutionary tree, of such exploits. We have applied this methodology to traffic captured at several vantage points, and we demonstrate that our methodology is robust to the observed polymorphism. Our techniques revealed non-trivial code sharing among different exploit families, and the resulting phylogenies accurately captured the subtle variations among exploits within each family. Thus, we believe our methodology and results are a helpful step to better understanding the evolution of remote code injection exploits on the Internet.

Communications of The ACM | 2011