Sabyasachi Saha | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sabyasachi Saha is active.

Explore More

Publication

Featured researches published by Sabyasachi Saha.

international conference on computer communications | 2013

Combining supervised and unsupervised learning for zero-day malware detection

Prakash Mandayam Comar; Lei Liu; Sabyasachi Saha; Pang Ning Tan; Antonio Nucci

Malware is one of the most damaging security threats facing the Internet today. Despite the burgeoning literature, accurate detection of malware remains an elusive and challenging endeavor due to the increasing usage of payload encryption and sophisticated obfuscation methods. Also, the large variety of malware classes coupled with their rapid proliferation and polymorphic capabilities and imperfections of real-world data (noise, missing values, etc) continue to hinder the use of more sophisticated detection algorithms. This paper presents a novel machine learning based framework to detect known and newly emerging malware at a high precision using layer 3 and layer 4 network traffic features. The framework leverages the accuracy of supervised classification in detecting known classes with the adaptability of unsupervised learning in detecting new classes. It also introduces a tree-based feature transformation to overcome issues due to imperfections of the data and to construct more informative features for the malware detection task. We demonstrate the effectiveness of the framework using real network data from a large Internet service provider.

international conference on computer communications | 2014

Detecting Malicious HTTP Redirections Using Trees of User Browsing Activity

Hesham Mekky; Ruben Torres; Zhi Li Zhang; Sabyasachi Saha; Antonio Nucci

The web has become a platform that attackers exploit to infect vulnerable hosts, or deceive victims into buying rogue software. To accomplish this, attackers either inject malicious scripts into popular web sites or manipulate content delivered by servers to exploit vulnerabilities in users browsers. To hide malware distribution servers, attackers employ HTTP redirections, which automatically redirect users requests through a series of intermediate web sites, before landing on the final distribution site. In this paper, we develop a methodology to identify malicious chains of HTTP redirections. We build per-user chains from passively collected traffic and extract novel statistical features from them, which capture inherent characteristics from malicious redirection cases. Then, we apply a supervised decision tree classifier to identify malicious chains. Using a large ISP dataset, with more than 15K clients, we demonstrate that our methodology is very effective in accurately identifying malicious chains, with recall and precision values over 90% and up to 98%.

international conference on distributed computing systems | 2015

Systematic Mining of Associated Server Herds for Malware Campaign Discovery

Jialong Zhang; Sabyasachi Saha; Guofei Gu; Sung-Ju Lee; Marco Mellia

HTTP is a popular channel for malware to communicate with malicious servers (e.g., Command & Control, drive-by download, drop-zone), as well as to attack benign servers. By utilizing HTTP requests, malware easily disguises itself under a large amount of benign HTTP traffic. Thus, identifying malicious HTTP activities is challenging. We leverage an insight that cyber criminals are increasingly using dynamic malicious infrastructures with multiple servers to be efficient and anonymous in (i) malware distribution (using redirectors and exploit servers), (ii) control (using C&C servers) and (iii) monetization (using payment servers), and (iv) being robust against server takedowns (using multiple backups for each type of servers). Instead of focusing on detecting individual malicious domains, we propose a complementary approach to identify a group of closely related servers that are potentially involved in the same malware campaign, which we term as Associated Server Herd (ASH). Our solution, SMASH (Systematic Mining of Associated Server Herds), utilizes an unsupervised framework to infer malware ASHs by systematically mining the relations among all servers from multiple dimensions. We build a prototype system of SMASH and evaluate it with traces from a large ISP. The result shows that SMASH successfully infers a large number of previously undetected malicious servers and possible zero-day attacks, with low false positives. We believe the inferred ASHs provide a better global view of the attack campaign that may not be easily captured by detecting only individual servers.

advances in social networks analysis and mining | 2014

Detecting malicious clients in ISP networks using HTTP connectivity graph and flow information

Lei Liu; Sabyasachi Saha; Ruben Torres; Jianpeng Xu; Pang Ning Tan; Antonio Nucci; Marco Mellia

This paper considers an approach to identify previously undetected malicious clients in Internet Service Provider (ISP) networks by combining flow classification with a graph-based score propagation method. Our approach represents all HTTP communications between clients and servers as a weighted, near-bipartite graph, where the nodes correspond to the IP addresses of clients and servers while the links are their interconnections, weighted according to the output of a flow-based classifier. We employ a two-phase alternating score propagation algorithm on the graph to identify suspicious clients in a monitored network. Using a symmetrized weighted adjacency matrix as its input, we show that our score propagation algorithm is less vulnerable towards inflating the malicious scores of popular Web servers with high in-degrees compared to the normalization used in PageRank, a widely used graph-based method. Experimental results on a 4-hour network trace collected by a large Internet service provider showed that incorporating flow information into score propagation significantly improves the precision of the algorithm.

conference on information and knowledge management | 2012

Weighted linear kernel with tree transformed features for malware detection

Prakash Mandayam Comar; Lei Liu; Sabyasachi Saha; Antonio Nucci; Pang Ning Tan

Malware detection from network traffic flows is a challenging problem due to data irregularity issues such as imbalanced class distribution, noise, missing values, and heterogeneous types of features. To address these challenges, this paper presents a two-stage classification approach for malware detection. The framework initially employs random forest as a macro-level classifier to separate the malicious from non-malicious network flows, followed by a collection of one-class support vector machine classifiers to identify the specific type of malware. A novel tree-based feature construction approach is proposed to deal with data imperfection issues. As the performance of the support vector machine classifier often depends on the kernel function used to compute the similarity between every pair of data points, designing an appropriate kernel is essential for accurate identification of malware classes. We present a simple algorithm to construct a weighted linear kernel on the tree transformed features and demonstrate its effectiveness in detecting malware from real network traffic data.

communications and networking symposium | 2015

Leveraging client-side DNS failure patterns to identify malicious behaviors

Pengkui Luo; Ruben Torres; Zhi Li Zhang; Sabyasachi Saha; Sung-Ju Lee; Antonio Nucci; Marco Mellia

DNS has been increasingly abused by adversaries for cyber-attacks. Recent research has leveraged DNS failures (i.e. DNS queries that result in a Non-Existent-Domain response from the server) to identify malware activities, especially domain-flux botnets that generate many random domains as a rendezvous technique for command-&-control. Using ISP network traces, we conduct a systematic analysis of DNS failure characteristics, with the goal of uncovering how attackers exploit DNS for malicious activities. In addition to DNS failures generated by domain-flux bots, we discover many diverse and stealthy failure patterns that have received little attention. Based on these findings, we present a framework that detects diverse clusters of suspicious domain names that cause DNS failures, by considering multiple types of syntactic as well as temporal patterns. Our evolutionary learning framework evaluates the clusters produced over time to eliminate spurious cases while retaining sustaining (i.e., highly suspicious) clusters. One of the advantages of our framework is in analyzing DNS failures on per-client basis and not hinging on the existence of multiple clients infected by the same malware. Our evaluation on a large ISP network trace shows that our framework detects at least 97% of the clients with suspicious DNS behaviors, with over 81% precision.

IEEE Transactions on Cognitive Communications and Networking | 2015

YouLighter: A Cognitive Approach to Unveil YouTube CDN and Changes

Danilo Giordano; Stefano Traverso; Luigi Grimaudo; Marco Mellia; Elena Maria Baralis; Alok Tongaonkar; Sabyasachi Saha

YouTube relies on a massively distributed content delivery network (CDN) to stream the billions of videos in its catalog. Unfortunately, very little information about the design of such CDN is available. This, combined with the pervasiveness of YouTube, poses a big challenge for Internet service providers (ISPs), which are compelled to optimize end-users quality of experience (QoE) while having almost no visibility and understanding of CDN decisions. This paper presents YouLighter, an unsupervised technique that builds upon cognitive methodologies to identify changes in how the YouTube CDN serves traffic. YouLighter leverages only passive measurements and clustering algorithms to group caches that appear colocated and identical into edge-nodes. This automatically unveils the YouTube edge-nodes used by the ISP customers. Next, we leverage a new metric, called Pattern Dissimilarity, that compares the clustering results obtained from two different time snapshots to pinpoint sudden changes. By running YouLighter over 10-month long traces obtained from two ISPs in different countries, we pinpoint both sudden changes in edge-node allocation, and small alterations to the cache allocation policies, which actually impair the QoE that the end-users perceive.

Computer Networks | 2016

MAGMA network behavior classifier for malware traffic

Enrico Bocchi; Luigi Grimaudo; Marco Mellia; Elena Maria Baralis; Sabyasachi Saha; Stanislav Miskovic; Gaspar Modelo-Howard; Sung-Ju Lee

Malware is a major threat to security and privacy of network users. A large variety of malware is typically spread over the Internet, hiding in benign traffic. New types of malware appear every day, challenging both the research community and security companies to improve malware identification techniques. In this paper we present MAGMA, MultilAyer Graphs for MAlware detection, a novel malware behavioral classifier. Our system is based on a Big Data methodology, driven by real-world data obtained from traffic traces collected in an operational network. The methodology we propose automatically extracts patterns related to a specific input event, i.e., a seed, from the enormous amount of events the network carries. By correlating such activities over (i) time, (ii) space, and (iii) network protocols, we build a Network Connectivity Graph that captures the overall network behavior of the seed. We next extract features from the Connectivity Graph and design a supervised classifier. We run MAGMA on a large dataset collected from a commercial Internet Provider where 20,000 Internet users generated more than 330 million events. Only 42,000 are flagged as malicious by a commercial IDS, which we consider as an oracle. Using this dataset, we experimentally evaluate MAGMA accuracy and robustness to parameter settings. Results indicate that MAGMA reaches 95% accuracy, with limited false positives. Furthermore, MAGMA proves able to identify suspicious network events that the IDS ignored.

international teletraffic congress | 2015

YouLighter: An Unsupervised Methodology to Unveil YouTube CDN Changes

Danilo Giordano; Stefano Traverso; Luigi Grimaudo; Marco Mellia; Elena Maria Baralis; Alok Tongaonkar; Sabyasachi Saha

YouTube relies on a massively distributed Content Delivery Network (CDN) to stream the billions of videos in its catalogue. Unfortunately, very little information about the design of such CDN is available. This, combined with the pervasiveness of YouTube, poses a big challenge for Internet Service Providers (ISPs), which are compelled to optimize end-users Quality of Experience (QoE) while having no control on the CDN decisions.This paper presents YouLighter, an unsupervised technique to identify changes in the YouTube CDN. YouLighter leverages only passive measurements to cluster co-located identical caches into edge-nodes. This automatically unveils the structure of YouTubes CDN. Further, we propose a new metric, called Pattern Dissimilarity, that compares the clustering obtained from two different time snapshots, to pinpoint sudden changes. While several approaches allows us to compare the clustering results from the same dataset, no technique measures the similarity of clusters from different datasets. Hence, we develop a novel methodology, based on the Pattern Dissimilarity, to solve this problem.By running YouLighter over 10-month long traces obtained from ISPs, we pinpoint both sudden changes in edge-node allocation, and modifications to the cache allocation policy which actually impair the QoE that the end-users perceive.

passive and active network measurement | 2014

On Understanding User Interests through Heterogeneous Data Sources

Samamon Khemmarat; Sabyasachi Saha; Han Hee Song; Mario Baldi; Lixin Gao

User interests can be learned from multiple sources, each of them presenting only partial facets. We propose an approach to merge user information from disparate data sources to enable a more complete, enriched view of user interests. Using our approach, we show that merging different sources results in three times of more interest categories in user profiles than with each single source and that merged profiles can capture much more common interests among a group of users, which is key to group profiling.

Explore More