Dino Ienco
University of Turin
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dino Ienco.
ACM Transactions on Knowledge Discovery From Data | 2012
Dino Ienco; Ruggero G. Pensa; Rosa Meo
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of a categorical attribute, since the values are not ordered. In this article, we propose a framework to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute Ai can be determined by the way in which the values of the other attributes Aj are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of Ai a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes Aj. We validate our approach by embedding our distance learning framework in a hierarchical clustering algorithm. We applied it on various real world and synthetic datasets, both low and high-dimensional. Experimental results show that our method is competitive with respect to the state of the art of categorical data clustering approaches. We also show that our approach is scalable and has a low impact on the overall computational time of a clustering task.
international conference on data mining | 2010
Dino Ienco; Francesco Bonchi; Carlos Castillo
Microblogging is a communication paradigm in which users post bits of information (brief text updates or micro media such as photos, video or audio clips) that are visible by their communities. When a user finds a “meme” of another user interesting, she can eventually repost it, thus allowing memes to propagate virally trough a social network. In this paper we introduce the meme ranking problem, as the problem of selecting which k memes (among the ones posted their contacts) to show to users when they log into the system. The objective is to maximize the overall activity of the network, that is, the total number of reposts that occur. We deeply characterize the problem showing that not only exact solutions are unfeasible, but also approximated solutions are prohibitive to be adopted in an on-line setting. Therefore we devise a set of heuristics and we compare them trough an extensive simulation based on the real-world Yahoo! Meme social graph, and with parameters learnt from real logs of meme propagations. Our experimentation demonstrates the effectiveness and feasibility of these methods.
intelligent data analysis | 2009
Dino Ienco; Ruggero G. Pensa; Rosa Meo
Clustering data described by categorical attributes is a challenging task in data mining applications. Unlike numerical attributes, it is difficult to define a distance between pairs of values of the same categorical attribute, since they are not ordered. In this paper, we propose a method to learn a context-based distance for categorical attributes. The key intuition of this work is that the distance between two values of a categorical attribute A i can be determined by the way in which the values of the other attributes A j are distributed in the dataset objects: if they are similarly distributed in the groups of objects in correspondence of the distinct values of A i a low value of distance is obtained. We propose also a solution to the critical point of the choice of the attributes A j . We validate our approach on various real world and synthetic datasets, by embedding our distance learning method in both a partitional and a hierarchical clustering algorithm. Experimental results show that our method is competitive w.r.t. categorical data clustering approaches in the state of the art.
Data Mining and Knowledge Discovery | 2013
Dino Ienco; Céline Robardet; Ruggero G. Pensa; Rosa Meo
The availability of data represented with multiple features coming from heterogeneous domains is getting more and more common in real world applications. Such data represent objects of a certain type, connected to other types of data, the features, so that the overall data schema forms a star structure of inter-relationships. Co-clustering these data involves the specification of many parameters, such as the number of clusters for the object dimension and for all the features domains. In this paper we present a novel co-clustering algorithm for heterogeneous star-structured data that is parameter-less. This means that it does not require either the number of row clusters or the number of column clusters for the given feature spaces. Our approach optimizes the Goodman–Kruskal’s τ, a measure for cross-association in contingency tables that evaluates the strength of the relationship between two categorical variables. We extend τ to evaluate co-clustering solutions and in particular we apply it in a higher dimensional setting. We propose the algorithm CoStar which optimizes τ by a local search approach. We assess the performance of CoStar on publicly available datasets from the textual and image domains using objective external criteria. The results show that our approach outperforms state-of-the-art methods for the co-clustering of heterogeneous data, while it remains computationally efficient.
discovery science | 2013
Dino Ienco; Albert Bifet; Indrė Žliobaitė; Bernhard Pfahringer
Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported.
european conference on machine learning | 2009
Dino Ienco; Ruggero G. Pensa; Rosa Meo
Clustering high-dimensional data is challenging. Classic metrics fail in identifying real similarities between objects. Moreover, the huge number of features makes the cluster interpretation hard. To tackle these problems, several co-clustering approaches have been proposed which try to compute a partition of objects and a partition of features simultaneously. Unfortunately, these approaches identify only a predefined number of flat co-clusters. Instead, it is useful if the clusters are arranged in a hierarchical fashion because the hierarchy provides insides on the clusters. In this paper we propose a novel hierarchical co-clustering, which builds two coupled hierarchies, one on the objects and one on features thus providing insights on both them. Our approach does not require a pre-specified number of clusters, and produces compact hierarchies because it makes n ***ary splits, where n is automatically determined. We validate our approach on several high-dimensional datasets with state of the art competitors.
Archive | 2009
Rosa Meo; Dino Ienco
The growing complexity and volume of modern databases make it increasingly important for researchers and practitioners involved with association rule mining to make sense of the information they contain. Rare Association Rule Mining and Knowledge Discovery: Technologies for Infrequent and Critical Event Detection provides readers with an in-depth compendium of current issues, trends, and technologies in association rule mining. Covering a comprehensive range of topics, this book discusses underlying frameworks, mining techniques, interest metrics, and real-world application domains within the field.Association rules are an intuitive descriptive paradigm that has been used extensively in later years and in different application domains with the purpose to identify the regularities and correlation in a set of observed objects. However, recently, association rules’ statistical measures (support and confidence) have been criticized because in some cases have shown to fail their primary goal that is to select the most relevant and significant association rules. In this paper we propose a new model that replaces the support measure. The new model, like support, is a tool for the identification of the reliable rules and is used also to reduce the traversal of the itemsets search space. The proposed model adopts new criteria in order to establish the reliability of the information extracted from the database. These criteria are based on Bayes’ Theorem and on an estimate of the probability density function of each itemset. According to our criteria, the information that we have obtained from the database on an itemset is reliable if and only if the confidence interval of the estimated probability is low compared with the most likely value of it. We will see how this method can be computed in an approximated way, but satisfactory, with computational time comparable to the test on support
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing | 2013
Elodie Vintrou; Dino Ienco; Agnès Bégué; Maguelonne Teisseire
The northern fringe of sub-Saharan Africa is a region that is considered to be particularly vulnerable to climate variability and change, and it is a location in which food security remains a major challenge. To address these issues, it is essential to develop global data sets of the geographic distribution of agricultural land use. The objectives of this study were to test an original data mining approach for classifying and mapping the cropped land in West Africa using coarse-resolution imagery and to compare the classification results with those obtained from a classic ISODATA approach. The data mining approach is able to handle large volumes of data and is based on different descriptors (65) of the land use, including the spatial and temporal satellite-derived metrics of 12 MODIS NDVI 16-day composite images and the static attributes taken from field surveys. The classic ISODATA method showed that 68.3% of pixels from a SPOT reference map were correctly classified in three validation sites versus 57.8% for the data mining approach. Validation by field observations showed equivalent results for both methods with an F-score of 0.72. The results of this study demonstrated the relevance of the use of data-mining tools for large-area monitoring.
Pattern Recognition | 2012
Rosa Meo; Dipankar Bachar; Dino Ienco
Current work on assembling a set of local patterns such as rules and class association rules into a global model for the prediction of a target usually focuses on the identification of the minimal set of patterns that cover the training data. In this paper we present a different point of view: the model of a class has been built with the purpose to emphasize the typical features of the examples of the class. Typical features are modeled by frequent itemsets extracted from the examples and constitute a new representation space of the examples of the class. Prediction of the target class of test examples occurs by computation of the distance between the vector representing the example in the space of the itemsets of each class and the vectors representing the classes. It is interesting to observe that in the distance computation the critical contribution to the discrimination between classes is given not only by the itemsets of the class model that match the example but also by itemsets that do not match the example. These absent features constitute some pieces of information on the examples that can be considered for the prediction and should not be disregarded. Second, absent features are more abundant in the wrong classes than in the correct ones and their number increases the distance between the example vector and the negative class vectors. Furthermore, since absent features are frequent features in their respective classes, they make the prediction more robust against over-fitting and noise. The usage of features absent in the test example is a novel issue in classification: existing learners usually tend to select the best local pattern that matches the example and do not consider the abundance of other patterns that do not match it. We demonstrate the validity of our observations and the effectiveness of LODE, our learner, by means of extensive empirical experiments in which we compare the prediction accuracy of LODE with a consistent set of classifiers of the state of the art. In this paper we also report the methodology that we adopted in order to determine automatically the setting of the learner and of its parameters.
Data Mining and Knowledge Discovery | 2014
Ruggero G. Pensa; Dino Ienco; Rosa Meo
Clustering data is challenging especially for two reasons. The dimensionality of the data is often very high which makes the cluster interpretation hard. Moreover, with high-dimensional data the classic metrics fail in identifying the real similarities between objects. The second challenge is the evolving nature of the observed phenomena which makes the datasets accumulating over time. In this paper we show how we propose to solve these problems. To tackle the high-dimensionality problem, we propose to apply a co-clustering approach on the dataset that stores the occurrence of features in the observed objects. Co-clustering computes a partition of objects and a partition of features simultaneously. The novelty of our co-clustering solution is that it arranges the clusters in a hierarchical fashion, and it consists of two hierarchies: one on the objects and one on the features. The two hierarchies are coupled because the clusters at a certain level in one hierarchy are coupled with the clusters at the same level of the other hierarchy and form the co-clusters. Each cluster of one of the two hierarchies thus provides insights on the clusters of the other hierarchy. Another novelty of the proposed solution is that the number of clusters is possibly unlimited. Nevertheless, the produced hierarchies are still compact and therefore more readable because our method allows multiple splits of a cluster at the lower level. As regards the second challenge, the accumulating nature of the data makes the datasets intractably huge over time. In this case, an incremental solution relieves the issue because it partitions the problem. In this paper we introduce an incremental version of our algorithm of hierarchical co-clustering. It starts from an intermediate solution computed on the previous version of the data and it updates the co-clustering results considering only the added block of data. This solution has the merit of speeding up the computation with respect to the original approach that would recompute the result on the overall dataset. In addition, the incremental algorithm guarantees approximately the same answer than the original version, but it saves much computational load. We validate the incremental approach on several high-dimensional datasets and perform an accurate comparison with both the original version of our algorithm and with the state of the art competitors as well. The obtained results open the way to a novel usage of the co-clustering algorithms in which it is advantageous to partition the data into several blocks and process them incrementally thus “incorporating” data gradually into an on-going co-clustering solution.