Pauli Miettinen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pauli Miettinen is active.

Explore More

Publication

Featured researches published by Pauli Miettinen.

IEEE Transactions on Knowledge and Data Engineering | 2008

The Discrete Basis Problem

Pauli Miettinen; Taneli Mielikäinen; Aristides Gionis; Gautam Das; Heikki Mannila

Matrix decomposition methods represent a data matrix as a product of two factor matrices: one containing basis vectors that represent meaningful concepts in the data, and another describing how the observed data can be expressed as combinations of the basis vectors. Decomposition methods have been studied extensively, but many methods return real-valued matrices. Interpreting real-valued factor matrices is hard if the original data is Boolean. In this paper, we describe a matrix decomposition formulation for Boolean data, the Discrete Basis Problem. The problem seeks for a Boolean decomposition of a binary matrix, thus allowing the user to easily interpret the basis vectors. We also describe a variation of the problem, the Discrete Basis Partitioning Problem. We show that both problems are NP-hard. For the Discrete Basis Problem, we give a simple greedy algorithm for solving it; for the Discrete Basis Partitioning Problem we show how it can be solved using existing methods. We present experimental results for the greedy algorithm and compare it against other, well known methods. Our algorithm gives intuitive basis vectors, but its reconstruction error is usually larger than with the real-valued methods. We discuss about the reasons for this behavior.

european conference on principles of data mining and knowledge discovery | 2006

The discrete basis problem

Pauli Miettinen; Taneli Mielikäinen; Aristides Gionis; Gautam Das; Heikki Mannila

ACM Transactions on Knowledge Discovery From Data | 2014

MDL4BMF: Minimum Description Length for Boolean Matrix Factorization

Pauli Miettinen; Jilles Vreeken

Matrix factorizations—where a given data matrix is approximated by a product of two or more factor matrices—are powerful data mining tools. Among other tasks, matrix factorizations are often used to separate global structure from noise. This, however, requires solving the “model order selection problem” of determining the proper rank of the factorization, that is, to answer where fine-grained structure stops, and where noise starts. Boolean Matrix Factorization (BMF)—where data, factors, and matrix product are Boolean—has in recent years received increased attention from the data mining community. The technique has desirable properties, such as high interpretability and natural sparsity. Yet, so far no method for selecting the correct model order for BMF has been available. In this article, we propose the use of the Minimum Description Length (MDL) principle for this task. Besides solving the problem, this well-founded approach has numerous benefits; for example, it is automatic, does not require a likelihood function, is fast, and, as experiments show, is highly accurate. We formulate the description length function for BMF in general—making it applicable for any BMF algorithm. We discuss how to construct an appropriate encoding: starting from a simple and intuitive approach, we arrive at a highly efficient data-to-model--based encoding for BMF. We extend an existing algorithm for BMF to use MDL to identify the best Boolean matrix factorization, analyze the complexity of the problem, and perform an extensive experimental evaluation to study its behavior.

Statistical Analysis and Data Mining | 2012

From black and white to full color: extending redescription mining outside the Boolean world

Esther Galbrun; Pauli Miettinen

Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology. Current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a pre-requisite, entailing a possibly harmful loss of information. In niche-finding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to categorical and real-valued data with possibly missing values using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behavior of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods.

international conference on data mining | 2011

Boolean Tensor Factorizations

Pauli Miettinen

Tensors are multi-way generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typically either all matrices or a mixture of matrices and tensors. With the widespread adoption of matrix factorization techniques in data mining, also tensor factorizations have started to gain attention. In this paper we study the Boolean tensor factorizations. We assume that the data is binary multi-way data, and we want to factorize it to binary factors using Boolean arithmetic (i.e. defining that 1+1=1). Boolean tensor factorizations are, therefore, natural generalization of the Boolean matrix factorizations. We will study the theory of Boolean tensor factorizations and show that at least some of the benefits Boolean matrix factorizations have over normal matrix factorizations carry over to the tensor data. We will also present algorithms for Boolean variations of CP and Tucker decompositions, the two most-common types of tensor factorizations. With experimentation done with synthetic and real-world data, we show that Boolean tensor factorizations are a viable alternative when the data is naturally binary.

siam international conference on data mining | 2011

From Black and White to Full Colour: Extending Redescription Mining Outside the Boolean World

Esther Galbrun; Pauli Miettinen

Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology. But current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a prerequisite, entailing a possibly harmful loss of information. In niche-finding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to real-valued data using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behaviour of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods.

international conference on data mining | 2010

Sparse Boolean Matrix Factorizations

Pauli Miettinen

Matrix factorizations are commonly used methods in data mining. When the input data is Boolean, replacing the standard matrix multiplication with Boolean matrix multiplication can yield more intuitive results. Unfortunately, finding a good Boolean decomposition is known to be computationally hard, with even many sub-problems being hard to approximate. Many real-world data sets are sparse, and it is often required that also the factor matrices are sparse. This requirement has motivated many new matrix decomposition methods and many modifications of the existing methods. This paper studies how Boolean matrix factorizations behave with sparse data: can we assume some sparsity on the factor matrices, and does the sparsity help with the computationally hard problems. The answer to these problems is shown to be positive.

Data Mining and Knowledge Discovery | 2008

The Boolean column and column-row matrix decompositions

Pauli Miettinen

Matrix decompositions are used for many data mining purposes. One of these purposes is to find a concise but interpretable representation of a given data matrix. Different decomposition formulations have been proposed for this task, many of which assume a certain property of the input data (e.g., nonnegativity) and aim at preserving that property in the decomposition. In this paper we propose new decomposition formulations for binary matrices, namely the Boolean CX and CUR decompositions. They are natural combinations of two previously presented decomposition formulations. We consider also two subproblems of these decompositions and present a rigorous theoretical study of the subproblems. We give algorithms for the decompositions and for the subproblems, and study their performance via extensive experimental evaluation. We show that even simple algorithms can give accurate and intuitive decompositions of real data, thus demonstrating the power and usefulness of the proposed decompositions.

Information Processing Letters | 2008

On the Positive--Negative Partial Set Cover problem

Pauli Miettinen

The Positive-Negative Partial Set Cover problem is introduced and its complexity, especially the hardness-of-approximation, is studied. The problem generalizes the Set Cover problem, and it naturally arises in certain data mining applications.

knowledge discovery and data mining | 2012

Siren: an interactive tool for mining and visualizing geospatial redescriptions

Esther Galbrun; Pauli Miettinen

We present SIREN, an interactive tool for mining and visualizing geospatial redescriptions. Redescription mining is a powerful data analysis tool that aims at finding alternative descriptions of the same entities. For example, in biology, an important task is to identify the bioclimatic constraints that allow some species to survive, that is, to describe geographical regions in terms of both the fauna that inhabits them and their bioclimatic conditions. Using SIREN, users can explore geospatial data of their interest by visualizing the redescriptions on a map, interactively edit, extend and filter them. To demonstrate the use of the tool, we focus on climatic niche-finding over Europe, as an example task. Yet, SIREN is by no means limited to a particular dataset or application.

Explore More