Varun Chandola | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Varun Chandola is active.

Explore More

Publication

Featured researches published by Varun Chandola.

ACM Computing Surveys | 2009

Anomaly detection: A survey

Varun Chandola; Arindam Banerjee; Vipin Kumar

Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into different categories based on the underlying approach adopted by each technique. For each category we have identified key assumptions, which are used by the techniques to differentiate between normal and anomalous behavior. When applying a given technique to a particular domain, these assumptions can be used as guidelines to assess the effectiveness of the technique in that domain. For each category, we provide a basic anomaly detection technique, and then show how the different existing techniques in that category are variants of the basic technique. This template provides an easier and more succinct understanding of the techniques belonging to each category. Further, for each category, we identify the advantages and disadvantages of the techniques in that category. We also provide a discussion on the computational complexity of the techniques since it is an important issue in real application domains. We hope that this survey will provide a better understanding of the different directions in which research has been done on this topic, and how techniques developed in one area can be applied in domains for which they were not intended to begin with.

IEEE Transactions on Knowledge and Data Engineering | 2012

Anomaly Detection for Discrete Sequences: A Survey

Varun Chandola; Arindam Banerjee; Vipin Kumar

This survey attempts to provide a comprehensive and structured overview of the existing research for the problem of detecting anomalies in discrete/symbolic sequences. The objective is to provide a global understanding of the sequence anomaly detection problem and how existing techniques relate to each other. The key contribution of this survey is the classification of the existing research into three distinct categories, based on the problem formulation that they are trying to solve. These problem formulations are: 1) identifying anomalous sequences with respect to a database of normal sequences; 2) identifying an anomalous subsequence within a long sequence; and 3) identifying a pattern in a sequence whose frequency of occurrence is anomalous. We show how each of these problem formulations is characteristically distinct from each other and discuss their relevance in various application domains. We review techniques from many disparate and disconnected application domains that address each of these formulations. Within each problem formulation, we group techniques into categories based on the nature of the underlying algorithm. For each category, we provide a basic anomaly detection technique, and show how the existing techniques are variants of the basic technique. This approach shows how different techniques within a category are related or different from each other. Our categorization reveals new variants and combinations that have not been investigated before for anomaly detection. We also provide a discussion of relative strengths and weaknesses of different techniques. We show how techniques developed for one problem formulation can be adapted to solve a different formulation, thereby providing several novel adaptations to solve the different problem formulations. We also highlight the applicability of the techniques that handle discrete sequences to other related areas such as online anomaly detection and time series anomaly detection.

international conference on data mining | 2005

Summarization - compressing data into an informative representation

Varun Chandola; Vipin Kumar

In this paper, we formulate the problem of summarization of a data set of transactions with categorical attributes as an optimization problem involving two objective functions – compaction gain and information loss. We propose metrics to characterize the output of any summarization algorithm. We investigate two approaches to address this problem. The first approach is an adaptation of clustering and the second approach makes use of frequent itemsets from the association analysis domain. We illustrate one application of summarization in the field of network data where we show how our technique can be effectively used to summarize network traffic into a compact but meaningful representation. Specifically, we evaluate our proposed algorithms on the 1998 DARPA Off-Line Intrusion Detection Evaluation data and network data generated by SKAION Corp for the ARDA information assurance program.

international conference on data mining | 2008

Comparative Evaluation of Anomaly Detection Techniques for Sequence Data

Varun Chandola; Varun Mithal; Vipin Kumar

We present a comparative evaluation of a large number of anomaly detection techniques on a variety of publicly available as well as artificially generated data sets. Many of these are existing techniques while some are slight variants and/or adaptations of traditional anomaly detection techniques to sequence data.

IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2012

Image Based Characterization of Formal and Informal Neighborhoods in an Urban Landscape

Jordan Graesser; Anil M. Cheriyadat; Ranga Raju Vatsavai; Varun Chandola; Jordan Long; Eddie A Bright

The high rate of global urbanization has resulted in a rapid increase in informal settlements, which can be defined as unplanned, unauthorized, and/or unstructured housing. Techniques for efficiently mapping these settlement boundaries can benefit various decision making bodies. From a remote sensing perspective, informal settlements share unique spatial characteristics that distinguish them from other types of structures (e.g., industrial, commercial, and formal residential). These spatial characteristics are often captured in high spatial resolution satellite imagery. We analyzed the role of spatial, structural, and contextual features (e.g., GLCM, Histogram of Oriented Gradients, Line Support Regions, Lacunarity) for urban neighborhood mapping, and computed several low-level image features at multiple scales to characterize local neighborhoods. The decision parameters to classify formal-, informal-, and non-settlement classes were learned under Decision Trees and a supervised classification framework. Experiments were conducted on high-resolution satellite imagery from the CitySphere collection, and four different cities (i.e., Caracas, Kabul, Kandahar, and La Paz) with varying spatial characteristics were represented. Overall accuracy ranged from 85% in La Paz, Bolivia, to 92% in Kandahar, Afghanistan. While the disparities between formal and informal neighborhoods varied greatly, many of the image statistics tested proved robust.

international workshop on analytics for big geospatial data | 2012

Spatiotemporal data mining in the era of big spatial data: algorithms and applications

Ranga Raju Vatsavai; Auroop R. Ganguly; Varun Chandola; Anthony Stefanidis; Scott Klasky; Shashi Shekhar

Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from the spatial and spatiotemporal data. However, explosive growth in the spatial and spatiotemporal data, and the emergence of social media and location sensing technologies emphasize the need for developing new and computationally efficient methods tailored for analyzing big data. In this paper, we review major spatial data mining algorithms by closely looking at the computational and I/O requirements and allude to few applications dealing with big spatial data.

Statistical Analysis and Data Mining | 2011

A scalable gaussian process analysis algorithm for biomass monitoring

Varun Chandola; Ranga Raju Vatsavai

Biomass monitoring is vital for studying the carbon cycle of earths ecosystem and has several significant implications, especially in the context of understanding climate change and its impacts. Recently, several change detection methods have been proposed to identify land cover changes in temporal profiles (time series) of vegetation collected using remote sensing instruments, but do not satisfy one or both of the two requirements of the biomass monitoring problem, that is, operating in online mode and handling periodic time series. In this paper, we adapt Gaussian process (GP) regression to detect changes in such time series in an online fashion. While GP have been widely used as a kernel-based learning method for regression and classification, their applicability to massive spatiotemporal data sets, such as remote sensing data, has been limited owing to the high computational costs involved. We focus on addressing the scalability issues associated with the proposed GP based change detection algorithm. This paper makes several significant contributions. First, we propose a GP based online time series change detection algorithm and demonstrate its effectiveness in detecting different types of changes in Normalized Difference Vegetation Index (NDVI) data obtained from a study area in IA, USA. Second, we propose an efficient Toeplitz matrix based solution which significantly improves the computational complexity and memory requirements of the proposed GP based method. Specifically, the proposed solution can analyze a time series of length t in O(t2) time while maintaining a O(t) memory footprint, compared to the O(t3) time and O(t2) memory requirement of standard matrix manipulation based methods. Third, we describe a parallel version of the proposed solution which can be used to simultaneously analyze a large number of time series. We study three different parallel implementations: using threads, Message Passing Interface (MPI), and a hybrid implementation using threads and MPI. Experimental results show that the hybrid implementation scales better than the multithreaded and MPI based implementations. The application of the proposed scalable algorithm is demonstrated in analyzing massive remote sensing observation data. The hybrid implementation, using 1536 computing cores, can analyze an NDVI data set for the Iowa study area in nearly 5 s, while a serial algorithm, using standard Cholesky decomposition based routines, takes several days to process the same data set.

international conference on data mining | 2010

Using Time Series Segmentation for Deriving Vegetation Phenology Indices from MODIS NDVI Data

Varun Chandola; Dafeng Hui; Lianhong Gu; Budhendra L. Bhaduri; Ranga Raju Vatsavai

Characterizing vegetation phenology is a highly significant problem, due to its importance in regulating ecosystem carbon cycling, interacting with climate changes, and decision-making of croplands managements. While ground based sensors, such as the AmeriFlux sensors, can provide measurements at high temporal resolution (every hour) and can be used to accurately calculate vegetation phenology indices, they are limited to only a few sites. Remote sensing data, such as the Normalized Difference Vegetation Index (NDVI), collected using the MODerate Resolution Imaging Spectroradiometer (MODIS), can provide global coverage, though at a much coarser temporal resolution (16 days). In this study we use data mining based time series segmentation methods to derive phenology indices from NDVI data, and compare it with the phenology indices derived from the AmeriFlux data using a widely used model fitting approach. Results show a significant correlation (as high as 0.60) between the indices derived from these two different data sources. This study demonstrates that data driven methods can be effectively employed to provide realistic estimates of vegetation phenology indices using periodic time series data and has the potential to be used at large spatial scales and for long-term remote sensing data.

Sigkdd Explorations | 2008

Knowledge discovery from sensor data (SensorKDD)

Varun Chandola; Olufemi A. Omitaomu; Auroop R. Ganguly; Ranga Raju Vatsavai; Nitesh V. Chawla; João Gama; Mohamed Medhat Gaber

Extracting knowledge and emerging patterns from sensor data is a nontrivial task. The challenges for the knowledge discovery community are expected to be immense. On one hand, dynamic data streams or events require real-time analysis methodologies and systems, while on the other hand centralized processing through high end computing is also required for generating offline predictive insights, which in turn can facilitate real-time analysis. In addition, emerging societal problems require knowledge discovery solutions that are designed to investigate anomalies, changes, extremes and nonlinear processes, and departures from the normal. Keeping in view the requirements of the emerging field of knowledge discovery from sensor data, we took initiative to develop a community of researchers with common interests and scientific goals, which culminated into the organization of Sensor-KDD series of workshops in conjunction with the prestigious ACM SIGKDD International Conference of Knowledge Discovery and Data Mining. In this report, we summarize the events of the Second ACM-SIGKDD International Workshop on Knowledge Discovery form Sensor Data (Sensor-KDD 2008).

IEEE Transactions on Knowledge and Data Engineering | 2012