Evangelos E. Papalexakis

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Evangelos E. Papalexakis is active.

Explore More

Publication

Featured researches published by Evangelos E. Papalexakis.

european conference on machine learning | 2012

ParCube: sparse parallelizable tensor decompositions

Evangelos E. Papalexakis; Christos Faloutsos; Nicholas D. Sidiropoulos

How can we efficiently decompose a tensor into sparse factors, when the data does not fit in memory? Tensor decompositions have gained a steadily increasing popularity in data mining applications, however the current state-of-art decomposition algorithms operate on main memory and do not scale to truly large datasets. In this work, we propose ParCube, a new and highly parallelizable method for speeding up tensor decompositions that is well-suited to producing sparse approximations. Experiments with even moderately large data indicate over 90% sparser outputs and 14 times faster execution, with approximation error close to the current state of the art irrespective of computation and memory requirements. We provide theoretical guarantees for the algorithms correctness and we experimentally validate our claims through extensive experiments, including four different real world datasets (Enron, Lbnl, Facebook and Nell), demonstrating its effectiveness for data mining practitioners. In particular, we are the first to analyze the very large Nell dataset using a sparse tensor decomposition, demonstrating that ParCube enables us to handle effectively and efficiently very large datasets.

IEEE Transactions on Signal Processing | 2013

From K-Means to Higher-Way Co-Clustering: Multilinear Decomposition With Sparse Latent Factors

Evangelos E. Papalexakis; Nicholas D. Sidiropoulos; Rasmus Bro

Co-clustering is a generalization of unsupervised clustering that has recently drawn renewed attention, driven by emerging data mining applications in diverse areas. Whereas clustering groups entire columns of a data matrix, co-clustering groups columns over select rows only, i.e., it simultaneously groups rows and columns. The concept generalizes to data “boxes” and higher-way tensors, for simultaneous grouping along multiple modes. Various co-clustering formulations have been proposed, but no workhorse analogous to K-means has emerged. This paper starts from K-means and shows how co-clustering can be formulated as a constrained multilinear decomposition with sparse latent factors. For three- and higher-way data, uniqueness of the multilinear decomposition implies that, unlike matrix co-clustering, it is possible to unravel a large number of possibly overlapping co-clusters. A basic multi-way co-clustering algorithm is proposed that exploits multilinearity using Lasso-type coordinate updates. Various line search schemes are then introduced to speed up convergence, and suitable modifications are proposed to deal with missing values. The imposition of latent sparsity pays a collateral dividend: it turns out that sequentially extracting one co-cluster at a time is almost optimal, hence the approach scales well for large datasets. The resulting algorithms are benchmarked against the state-of-art in pertinent simulations, and applied to measured data, including the ENRON e-mail corpus.

siam international conference on data mining | 2014

FlexiFaCT: Scalable Flexible Factorization of Coupled Tensors on Hadoop

Alex Beutel; Partha Pratim Talukdar; Abhimanu Kumar; Christos Faloutsos; Evangelos E. Papalexakis; Eric P. Xing

Given multiple data sets of relational data that share a number of dimensions, how can we efficiently decompose our data into the latent factors? Factorization of a single matrix or tensor has attracted much attention, as, e.g., in the Netflix challenge, with users rating movies. However, we often have additional, side, information, like, e.g., demographic data about the users, in the Netflix example above. Incorporating the additional information leads to the coupled factorization problem. So far, it has been solved for relatively small datasets. We provide a distributed, scalable method for decomposing matrices, tensors, and coupled data sets through stochastic gradient descent on a variety of objective functions. We offer the following contributions: (1) Versatility: Our algorithm can perform matrix, tensor, and coupled factorization, with flexible objective functions including the Frobenius norm, Frobenius norm with an `1 induced sparsity, and non-negative factorization. (2) Scalability: FlexiFaCT scales to unprecedented sizes in both the data and model, with up to billions of parameters. FlexiFaCT runs on standard Hadoop. (3) Convergence proofs showing that FlexiFaCT converges on the variety of objective functions, even with projections.

ACM Transactions on Intelligent Systems and Technology | 2017

Tensors for Data Mining and Data Fusion: Models, Applications, and Scalable Algorithms

Evangelos E. Papalexakis; Christos Faloutsos; Nicholas D. Sidiropoulos

Tensors and tensor decompositions are very powerful and versatile tools that can model a wide variety of heterogeneous, multiaspect data. As a result, tensor decompositions, which extract useful latent information out of multiaspect data tensors, have witnessed increasing popularity and adoption by the data mining community. In this survey, we present some of the most widely used tensor decompositions, providing the key insights behind them, and summarizing them from a practitioner’s point of view. We then provide an overview of a very broad spectrum of applications where tensors have been instrumental in achieving state-of-the-art performance, ranging from social network analysis to brain data analysis, and from web mining to healthcare. Subsequently, we present recent algorithmic advances in scaling tensor decompositions up to today’s big data, outlining the existing systems and summarizing the key ideas behind them. Finally, we conclude with a list of challenges and open problems that outline exciting future research directions.

IEEE Transactions on Knowledge and Data Engineering | 2014

HEigen: Spectral Analysis for Billion-Scale Graphs

U Kang; Brendan Meeder; Evangelos E. Papalexakis; Christos Faloutsos

Given a graph with billions of nodes and edges, how can we find patterns and anomalies? Are there nodes that participate in too many or too few triangles? Are there close-knit near-cliques? These questions are expensive to answer unless we have the first several eigenvalues and eigenvectors of the graph adjacency matrix. However, eigensolvers suffer from subtle problems (e.g., convergence) for large sparse matrices, let alone for billion-scale ones. We address this problem with the proposed HEIGEN algorithm, which we carefully design to be accurate, efficient, and able to run on the highly scalable MAPREDUCE (HADOOP) environment. This enables HEIGEN to handle matrices more than 1;000 × larger than those which can be analyzed by existing algorithms. We implement HEIGEN and run it on the M45 cluster, one of the top 50 supercomputers in the world. We report important discoveries about nearcliques and triangles on several real-world graphs, including a snapshot of the Twitter social network (56 Gb, 2 billion edges) and the “YahooWeb” data set, one of the largest publicly available graphs (120 Gb, 1.4 billion nodes, 6.6 billion edges).

pacific-asia conference on knowledge discovery and data mining | 2014

Com2: Fast automatic discovery of temporal ('comet') communities

Miguel Araújo; Spiros Papadimitriou; Stephan Günnemann; Christos Faloutsos; Prithwish Basu; Ananthram Swami; Evangelos E. Papalexakis; Danai Koutra

Given a large network, changing over time, how can we find patterns and anomalies? We propose Com2, a novel and fast, incremental tensor analysis approach, which can discover both transient and periodic/ repeating communities. The method is (a) scalable, being linear on the input size (b) general, (c) needs no user-defined parameters and (d) effective, returning results that agree with intuition.

IEEE Signal Processing Magazine | 2014

Parallel Randomly Compressed Cubes : A scalable distributed architecture for big tensor decomposition

Nicholas D. Sidiropoulos; Evangelos E. Papalexakis; Christos Faloutsos

This article combines a tutorial on state-of-the-art tensor decomposition as it relates to big data analytics, with original research on parallel and distributed computation of low-rank decomposition for big tensors, and a concise primer on Hadoop?MapReduce. A novel architecture for parallel and distributed computation of low-rank tensor decomposition that is especially well suited for big tensors is proposed. The new architecture is based on parallel processing of a set of randomly compressed, reduced-size replicas of the big tensor. Each replica is independently decomposed, and the results are joined via a master linear equation per tensor mode. The approach enables massive parallelism with guaranteed identifiability properties: if the big tensor is of low rank and the system parameters are appropriately chosen, then the rank-one factors of the big tensor will indeed be recovered from the analysis of the reduced-size replicas. Furthermore, the architecture affords memory/storage and complexity gains of order for a big tensor of size of rank F with No sparsity is required in the tensor or the underlying latent factors, although such sparsity can be exploited to improve memory, storage, and computational savings.

BMC Bioinformatics | 2014

Structure-revealing data fusion.

Evrim Acar; Evangelos E. Papalexakis; Gözde Gürdeniz; Morten Rasmussen; Anders J. Lawaetz; Mathias Nilsson; Rasmus Bro

BackgroundAnalysis of data from multiple sources has the potential to enhance knowledge discovery by capturing underlying structures, which are, otherwise, difficult to extract. Fusing data from multiple sources has already proved useful in many applications in social network analysis, signal processing and bioinformatics. However, data fusion is challenging since data from multiple sources are often (i) heterogeneous (i.e., in the form of higher-order tensors and matrices), (ii) incomplete, and (iii) have both shared and unshared components. In order to address these challenges, in this paper, we introduce a novel unsupervised data fusion model based on joint factorization of matrices and higher-order tensors.ResultsWhile the traditional formulation of coupled matrix and tensor factorizations modeling only shared factors fails to capture the underlying structures in the presence of both shared and unshared factors, the proposed data fusion model has the potential to automatically reveal shared and unshared components through modeling constraints. Using numerical experiments, we demonstrate the effectiveness of the proposed approach in terms of identifying shared and unshared components. Furthermore, we measure a set of mixtures with known chemical composition using both LC-MS (Liquid Chromatography - Mass Spectrometry) and NMR (Nuclear Magnetic Resonance) and demonstrate that the structure-revealing data fusion model can (i) successfully capture the chemicals in the mixtures and extract the relative concentrations of the chemicals accurately, (ii) provide promising results in terms of identifying shared and unshared chemicals, and (iii) reveal the relevant patterns in LC-MS by coupling with the diffusion NMR data.ConclusionsWe have proposed a structure-revealing data fusion model that can jointly analyze heterogeneous, incomplete data sets with shared and unshared components and demonstrated its promising performance as well as potential limitations on both simulated and real data.

international conference on acoustics, speech, and signal processing | 2011

Co-clustering as multilinear decomposition with sparse latent factors

Evangelos E. Papalexakis; Nicholas D. Sidiropoulos

The K-means clustering problem seeks to partition the columns of a data matrix in subsets, such that columns in the same subset are ‘close’ to each other. The co-clustering problem seeks to simultaneously partition the rows and columns of a matrix to produce ‘coherent’ groups called co-clusters. Co-clustering has recently found numerous applications in diverse areas. The concept readily generalizes to higher-way data sets (e.g., adding a temporal dimension). Starting from K-means, we show how co-clustering can be formulated as constrained multilinear decomposition with sparse latent factors. In the case of three- and higher-way data, this corresponds to a PARAFAC decomposition with sparse latent factors. This is important, for PARAFAC is unique under mild conditions - and sparsity further improves identifiability. This allows us to uniquely unravel a large number of possibly overlapping co-clusters that are hidden in the data. Interestingly, the imposition of latent sparsity pays a collateral dividend: as one increases the number of fitted co-clusters, new co-clusters are added without affecting those previously extracted. An important corollary is that co-clusters can be extracted incrementally; this implies that the algorithm scales well for large datasets. We demonstrate the validity of our approach using the ENRON corpus, as well as synthetic data.

advances in social networks analysis and mining | 2013

Spatio-temporal mining of software adoption & penetration

Evangelos E. Papalexakis; Tudor Dumitras; Duen Horng Chau; B. Aditya Prakash; Christos Faloutsos

How does malware propagate? Does it form spikes over time? Does it resemble the propagation pattern of benign files, such as software patches? Does it spread uniformly over countries? How long does it take for a URL that distributes malware to be detected and shut down? In this work, we answer these questions by analyzing patterns from 22 million malicious (and benign) files, found on 1.6 million hosts worldwide during the month of June 2011. We conduct this study using the WINE database available at Symantec Research Labs. Additionally, we explore the research questions raised by sampling on such large databases of executables; the importance of studying the implications of sampling is twofold: First, sampling is a means of reducing the size of the database hence making it more accessible to researchers; second, because every such data collection can be perceived as a sample of the real world. Finally, we discover the SHARKFIN temporal propagation pattern of executable files, the GEOSPLIT pattern in the geographical spread of machines that report executables to Symantecs servers, the Periodic Power Law (PPL) distribution of the life-time of URLs, and we show how to efficiently extrapolate crucial properties of the data from a small sample. To the best of our knowledge, our work represents the largest study of propagation patterns of executables.

Explore More