Saket Sathe | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Saket Sathe is active.

Explore More

Publication

Featured researches published by Saket Sathe.

Sigkdd Explorations | 2015

Theoretical Foundations and Algorithms for Outlier Ensembles

Charu C. Aggarwal; Saket Sathe

Ensemble analysis has recently been studied in the context of the outlier detection problem. In this paper, we investigate the theoretical underpinnings of outlier ensemble analysis. In spite of the significant differences between the classification and the outlier analysis problems, we show that the theoretical underpinnings between the two problems are actually quite similar in terms of the bias-variance trade-off. We explain the existing algorithms within this traditional framework, and clarify misconceptions about the reasoning underpinning these methods. We propose more effective variants of subsampling and feature bagging. We also discuss the impact of the combination function and discuss the specific trade-offs of the average and maximization functions. We use these insights to propose new combination functions that are robust in many settings.

Sigkdd Explorations | 2014

Twitter analytics: a big data management perspective

Oshini Goonetilleke; Timos K. Sellis; Xiuzhen Zhang; Saket Sathe

With the inception of the Twitter microblogging platform in 2006, a myriad of research efforts have emerged studying different aspects of the Twittersphere. Each study exploits its own tools and mechanisms to capture, store, query and analyze Twitter data. Inevitably, platforms have been developed to replace this ad-hoc exploration with a more structured and methodological form of analysis. Another body of literature focuses on developing languages for querying Tweets. This paper addresses issues around the big data nature of Twitter and emphasizes the need for new data management and query language frameworks that address limitations of existing systems. We review existing approaches that were developed to facilitate twitter analytics followed by a discussion on research issues and technical challenges in developing integrated solutions.

siam international conference on data mining | 2016

Kernelized Matrix Factorization for Collaborative Filtering.

Xinyue Liu; Charu C. Aggarwal; Yu-Feng Li; Xiangnan Kong; Xinyuan Sun; Saket Sathe

Matrix factorization (MF) methods have shown great promise in collaborative filtering (CF). Conventional MF methods usually assume that the correlated data is distributed on a linear hyperplane, which is not always the case. Kernel methods are used widely in SVMs to classify linearly non-separable data, as well as in PCA to discover the non-linear embeddings of data. In this paper, we present a novel method to kernelize matrix factorization for collaborative filtering, which is equivalent to performing the low-rank matrix factorization in a possibly much higher dimensional space that is implicitly defined by the kernel function. Inspired by the success of multiple kernel learning (MKL) methods, we also explore the approach of learning multiple kernels from the rating matrix to further improve the accuracy of prediction. Since the right choice of kernel is usually unknown, our proposed multiple kernel matrix factorization method helps to select effective kernel functions from the candidates. Through extensive experiments on real-world datasets, we show that our proposed method captures the nonlinear correlations among data, which results in improved prediction accuracy compared to the state-of-art CF models.

international conference on data mining | 2016

Subspace Outlier Detection in Linear Time with Randomized Hashing

Saket Sathe; Charu C. Aggarwal

Outlier detection algorithms are often computationally intensive because of their need to score each point in the data. Even simple distance-based algorithms have quadratic complexity. High-dimensional outlier detection algorithms such as subspace methods are often even more computationally intensive because of their need to explore different subspaces of the data. In this paper, we propose an exceedingly simple subspace outlier detection algorithm, which can be implemented in a few lines of code, and whose complexity is linear in the size of the data set and the space requirement is constant. We show that this outlier detection algorithm is much faster than both conventional and high-dimensional algorithms and also provides more accurate results. The approach uses randomized hashing to score data points and has a neat subspace interpretation. Furthermore, the approach can be easily generalized to data streams. We present experimental results showing the effectiveness of the approach over other state-of-the-art methods.

conference on information and knowledge management | 2015

Fast Distributed Correlation Discovery Over Streaming Time-Series Data

Tian Guo; Saket Sathe; Karl Aberer

The dramatic rise of time-series data in a variety of contexts, such as social networks, mobile sensing, data centre monitoring, etc., has fuelled interest in obtaining real-time insights from such data using distributed stream processing systems. One such extremely valuable insight is the discovery of correlations in real-time from large-scale time-series data. A key challenge in discovering correlations is that the number of time-series pairs that have to be analyzed grows quadratically in the number of time-series, giving rise to a quadratic increase in both computation cost and communication cost between the cluster nodes in a distributed environment. To tackle the challenge, we propose a framework called AEGIS. AEGIS exploits well-established statistical properties to dramatically prune the number of time-series pairs that have to be evaluated for detecting interesting correlations. Our extensive experimental evaluations on real and synthetic datasets establish the efficacy of AEGIS over baselines.

siam international conference on data mining | 2016

LODES: Local Density Meets Spectral Outlier Detection.

Saket Sathe; Charu C. Aggarwal

_e problem of outlier detection has been widely studied in existing literature because of its numerous applications in fraud detection,medical diagnostics, fault detection, and intrusion detection. A large category of outlier analysis algorithms have been proposed, such as proximity-based methods and local density-basedmethods. _esemethods are effective in ûnding outliers distributed along linear manifolds. Spectral methods, however, are particularly well suited to ûnding outliers when the data is distributed along manifolds of arbitrary shape. In practice, the underlying manifolds may have varying density, as a result of which a direct use of spectral methods may not be eòective. In this paper, we show how to combine spectral techniques with local density-based methods in order to discover interesting outliers. We present experimental results demonstrating the eòectiveness of our approach with respect to well-known competing methods.

knowledge discovery and data mining | 2017

Similarity Forests

Saket Sathe; Charu C. Aggarwal

Random forests are among the most successful methods used in data mining because of their extraordinary accuracy and effectiveness. However, their use is primarily limited to multidimensional data because they sample features from the original data set. In this paper, we propose a method for extending random forests to work with any arbitrary set of data objects, as long as similarities can be computed among the data objects. Furthermore, since it is understood that similarity computation between all O(n2) pairs of n objects might be expensive, our method computes only a very small fraction of the O(n2) pairwise similarities between objects to construct the forests. Our results show that the proposed similarity forest approach is very efficient and accurate on a wide variety of data sets. Therefore, this paper significantly extends the applicability of random forest methods to arbitrary data domains. Furthermore, the approach even outperforms traditional random forests on multidimensional data. We show that similarity forests are robust to the noisy similarity values that are ubiquitous in real-world applications. In many practical settings, the similarity values between objects are incompletely specified because of the difficulty in collecting such values. Similarity forests can be used in such cases with straightforward modifications.

international conference on management of data | 2015

Microblogging Queries on Graph Databases: An Introspection

Oshini Goonetilleke; Saket Sathe; Timos K. Sellis; Xiuzhen Zhang

Microblogging data is growing at a rapid pace. This poses new challenges to the data management systems, such as graph databases, that are typically suitable for analyzing such data. In this paper, we share our experience on executing a wide variety of micro-blogging queries on two popular graph databases: Neo4j and Sparksee. Our queries are designed to be relevant to popular applications of micro-blogging data. The queries are executed on a large real graph data set comprising of nearly 50 million nodes and 326 million edges.

Knowledge and Information Systems | 2018

Subspace histograms for outlier detection in linear time

Saket Sathe; Charu C. Aggarwal

Outlier detection algorithms are often computationally intensive because of their need to score each point in the data. Even simple distance-based algorithms have quadratic complexity. High-dimensional outlier detection algorithms such as subspace methods are often even more computationally intensive because of their need to explore different subspaces of the data. In this paper, we propose an exceedingly simple subspace outlier detection algorithm, which can be implemented in a few lines of code, and whose complexity is linear in the size of the data set and the space requirement is constant. We show that this outlier detection algorithm is much faster than both conventional and high-dimensional algorithms and also provides more accurate results. The approach uses randomized hashing to score data points and has a neat subspace interpretation. We provide a visual representation of this interpretability in terms of outlier sensitivity histograms. Furthermore, the approach can be easily generalized to data streams, where it provides an efficient approach to discover outliers in real time. We present experimental results showing the effectiveness of the approach over other state-of-the-art methods.

Sigkdd Explorations | 2016

Shedding Light on the Performance of Solar Panels: A Data-Driven View

Sue A. Chen; Arun Vishwanath; Saket Sathe; Shivkumar Kalyanaraman

The significant adoption of solar photovoltaic (PV) systems in both commercial and residential sectors has spurred an interest in monitoring the performance of these systems. This is facilitated by the increasing availability of regularly logged PV performance data in recent years. In this paper, we present a data-driven framework to systematically characterise the relationship between performance of an existing photovoltaic (PV) system and various environmental factors. We demonstrate the efficacy of our proposed framework by applying it to a PV generation dataset from a building located in northern Australia. We show how, in light of limited site-specific weather information, this data set may be coupled with publicly available data to yield rich insights on the performance of the buildings PV system.

Explore More