Spiros Papadimitriou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Spiros Papadimitriou is active.

Explore More

Publication

Featured researches published by Spiros Papadimitriou.

international conference on data engineering | 2003

LOCI: fast outlier detection using the local correlation integral

Spiros Papadimitriou; Hiroyuki Kitagawa; Phillip B. Gibbons; Christos Faloutsos

Outlier detection is an integral part of data mining and has attracted much attention recently [M. Breunig et al., (2000)], [W. Jin et al., (2001)], [E. Knorr et al., (2000)]. We propose a new method for evaluating outlierness, which we call the local correlation integral (LOCI). As with the best previous methods, LOCI is highly effective for detecting outliers and groups of outliers (a.k.a. micro-clusters). In addition, it offers the following advantages and novelties: (a) It provides an automatic, data-dictated cutoff to determine whether a point is an outlier-in contrast, previous methods force users to pick cut-offs, without any hints as to what cut-off value is best for a given dataset. (b) It can provide a LOCI plot for each point; this plot summarizes a wealth of information about the data in the vicinity of the point, determining clusters, micro-clusters, their diameters and their inter-cluster distances. None of the existing outlier-detection methods can match this feature, because they output only a single number for each point: its outlierness score, (c) Our LOCI method can be computed as quickly as the best previous methods, (d) Moreover, LOCI leads to a practically linear approximate method, aLOCI (for approximate LOCI), which provides fast highly-accurate outlier detection. To the best of our knowledge, this is the first work to use approximate computations to speed up outlier detection. Experiments on synthetic and real world data sets show that LOCI and aLOCI can automatically detect outliers and micro-clusters, without user-required cut-offs, and that they quickly spot both expected and unexpected outliers.

knowledge discovery and data mining | 2007

GraphScope: parameter-free mining of large time-evolving graphs

Jimeng Sun; Christos Faloutsos; Spiros Papadimitriou; Philip S. Yu

How can we find communities in dynamic networks of socialinteractions, such as who calls whom, who emails whom, or who sells to whom? How can we spot discontinuity time-points in such streams of graphs, in an on-line, any-time fashion? We propose GraphScope, that addresses both problems, using information theoretic principles. Contrary to the majority of earlier methods, it needs no user-defined parameters. Moreover, it is designed to operate on large graphs, in a streaming fashion. We demonstrate the efficiency and effectiveness of our GraphScope on real datasets from several diverse domains. In all cases it produces meaningful time-evolving patterns that agree with human intuition.

international conference on data mining | 2008

DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining

Spiros Papadimitriou; Jimeng Sun

Huge datasets are becoming prevalent; even as researchers, we now routinely have to work with datasets that are up to a few terabytes in size. Interesting real-world applications produce huge volumes of messy data. The mining process involves several steps, starting from pre-processing the raw data to estimating the final models. As data become more abundant, scalable and easy-to-use tools for distributed processing are also emerging. Among those, Map-Reduce has been widely embraced by both academia and industry. In database terms, Map-Reduce is a simple yet powerful execution engine, which can be complemented with other data storage and management components, as necessary. In this paper we describe our experiences and findings in applying Map-Reduce, from raw data to final models, on an important mining task. In particular, we focus on co-clustering, which has been studied in many applications such as text mining, collaborative filtering, bio-informatics, graph mining. We propose the distributed co-clustering (DisCo) framework, which introduces practical approaches for distributed data pre-processing, and co-clustering. We develop DisCo using Hadoop, an open source Map-Reduce implementation. We show that DisCo can scale well and efficiently process and analyze extremely large datasets (up to several hundreds of gigabytes) on commodity hardware.

ACM Transactions on Knowledge Discovery From Data | 2008

Incremental tensor analysis: Theory and applications

Jimeng Sun; Dacheng Tao; Spiros Papadimitriou; Philip S. Yu; Christos Faloutsos

How do we find patterns in author-keyword associations, evolving over time? Or in data cubes (tensors), with product-branchcustomer sales information? And more generally, how to summarize high-order data cubes (tensors)? How to incrementally update these patterns over time? Matrix decompositions, like principal component analysis (PCA) and variants, are invaluable tools for mining, dimensionality reduction, feature selection, rule identification in numerous settings like streaming data, text, graphs, social networks, and many more settings. However, they have only two orders (i.e., matrices, like author and keyword in the previous example). We propose to envision such higher-order data as tensors, and tap the vast literature on the topic. However, these methods do not necessarily scale up, let alone operate on semi-infinite streams. Thus, we introduce a general framework, incremental tensor analysis (ITA), which efficiently computes a compact summary for high-order and high-dimensional data, and also reveals the hidden correlations. Three variants of ITA are presented: (1) dynamic tensor analysis (DTA); (2) streaming tensor analysis (STA); and (3) window-based tensor analysis (WTA). In paricular, we explore several fundamental design trade-offs such as space efficiency, computational cost, approximation accuracy, time dependency, and model complexity. We implement all our methods and apply them in several real settings, such as network anomaly detection, multiway latent semantic indexing on citation networks, and correlation study on sensor measurements. Our empirical studies show that the proposed methods are fast and accurate and that they find interesting patterns and outliers on the real datasets.

international conference on management of data | 2005

BRAID: stream mining through group lag correlations

Yasushi Sakurai; Spiros Papadimitriou; Christos Faloutsos

The goal is to monitor multiple numerical streams, and determine which pairs are correlated with lags, as well as the value of each such lag. Lag correlations (and anti-correlations) are frequent, and very interesting in practice: For example, a decrease in interest rates typically precedes an increase in house sales by a few months; higher amounts of fluoride in the drinking water may lead to fewer dental cavities, some years later. Additional settings include network analysis, sensor monitoring, financial data analysis, and moving object tracking. Such data streams are often correlated (or anti-correlated), but with an unknown lag.We propose BRAID, a method to detect lag correlations between data streams. BRAID can handle data streams of semi-infinite length, incrementally, quickly, and with small resource consumption. We also provide a theoretical analysis, which, based on Nyquists sampling theorem, shows that BRAID can estimate lag correlations with little, and often with no error at all. Our experiments on real and realistic data show that BRAID detects the correct lag perfectly most of the time (the largest relative error was about 1%); while it is up to 40,000 times faster than the naive implementation.

knowledge discovery and data mining | 2008

Colibri: fast mining of large static and dynamic graphs

Hanghang Tong; Spiros Papadimitriou; Jimeng Sun; Philip S. Yu; Christos Faloutsos

Low-rank approximations of the adjacency matrix of a graph are essential in finding patterns (such as communities) and detecting anomalies. Additionally, it is desirable to track the low-rank structure as the graph evolves over time, efficiently and within limited storage. Real graphs typically have thousands or millions of nodes, but are usually very sparse. However, standard decompositions such as SVD do not preserve sparsity. This has led to the development of methods such as CUR and CMD, which seek a non-orthogonal basis by sampling the columns and/or rows of the sparse matrix. However, these approaches will typically produce overcomplete bases, which wastes both space and time. In this paper we propose the family of Colibri methods to deal with these challenges. Our version for static graphs, Colibri-S, iteratively finds a non-redundant basis and we prove that it has no loss of accuracy compared to the best competitors (CUR and CMD), while achieving significant savings in space and time: on real data, Colibri-S requires much less space and is orders of magnitude faster (in proportion to the square of the number of non-redundant columns). Additionally, we propose an efficient update algorithm for dynamic, time-evolving graphs, Colibri-D. Our evaluation on a large, real network traffic dataset shows that Colibri-D is over 100 times faster than the best published competitor (CMD).

international conference on management of data | 2006

Optimal multi-scale patterns in time series streams

Spiros Papadimitriou; Philip S. Yu

We introduce a method to discover optimal local patterns, which concisely describe the main trends in a time series. Our approach examines the time series at multiple time scales (i.e., window sizes) and efficiently discovers the key patterns in each. We also introduce a criterion to select the best window sizes, which most concisely capture the key oscillatory as well as aperiodic trends. Our key insight lies in learning an optimal orthonormal transform from the data itself, as opposed to using a predetermined basis or approximating function (such as piecewise constant, short-window Fourier or wavelets), which essentially restricts us to a particular family of trends. We go one step further, lifting even that limitation. Furthermore, our method lends itself to fast, incremental estimation in a streaming setting. Experimental evaluation shows that our method can capture meaningful patterns in a variety of settings. Our streaming approach requires order of magnitude less time and space, while still producing concise and informative patterns.

international conference on data mining | 2006

Local Correlation Tracking in Time Series

Spiros Papadimitriou; Jimeng Sun; Philip S. Yu

We address the problem of capturing and tracking local correlations among time evolving time series. Our approach is based on comparing the local auto-covariance matrices (via their spectral decompositions) of each series and generalizes the notion of linear cross-correlation. In this way, it is possible to concisely capture a wide variety of local patterns or trends. Our method produces a general similarity score, which evolves over time, and accurately reflects the changing relationships. Finally, it can also be estimated incrementally, in a streaming setting. We demonstrate its usefulness, robustness and efficiency on a wide range of real datasets.

international world wide web conferences | 2001

Vinci: a service-oriented architecture for rapid development of web applications

Rakesh Agrawal; Roberto J. Bayardo; Daniel Gruhl; Spiros Papadimitriou

Vinci is a local area service-oriented architecture designed for rapid development and management of robust web applications. Based on XML document exchange, Vinci is designed to complement and interoperate with wide area service-oriented architectures such as E-Speak and .NET. This paper presents the Vinci architecture, the rationale behind its design, and an evaluation of its performance. Specically, we show how systems architected with Vinci are developed quickly, scaled eortlessly, and easily moved from prototype to production.

IEEE Transactions on Knowledge and Data Engineering | 2015

A General Geographical Probabilistic Factor Model for Point of Interest Recommendation

Bin Liu; Hui Xiong; Spiros Papadimitriou; Yanjie Fu; Zijun Yao

The problem of point of interest (POI) recommendation is to provide personalized recommendations of places, such as restaurants and movie theaters. The increasing prevalence of mobile devices and of location based social networks (LBSNs) poses significant new opportunities as well as challenges, which we address. The decision process for a user to choose a POI is complex and can be influenced by numerous factors, such as personal preferences, geographical considerations, and user mobility behaviors. This is further complicated by the connection LBSNs and mobile devices. While there are some studies on POI recommendations, they lack an integrated analysis of the joint effect of multiple factors. Meanwhile, although latent factor models have been proved effective and are thus widely used for recommendations, adopting them to POI recommendations requires delicate consideration of the unique characteristics of LBSNs. To this end, in this paper, we propose a general geographical probabilistic factor model (Geo-PFM) framework which strategically takes various factors into consideration. Specifically, this framework allows to capture the geographical influences on a users check-in behavior. Also, user mobility behaviors can be effectively leveraged in the recommendation model. Moreover, based our Geo-PFM framework, we further develop a Poisson Geo-PFM which provides a more rigorous probabilistic generative process for the entire model and is effective in modeling the skewed user check-in count data as implicit feedback for better POI recommendations. Finally, extensive experimental results on three real-world LBSN datasets (which differ in terms of user mobility, POI geographical distribution, implicit response data skewness, and user-POI observation sparsity), show that the proposed recommendation methods outperform state-of-the-art latent factor models by a significant margin.

Explore More