Simon Malinowski
Agrocampus Ouest
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Simon Malinowski.
intelligent data analysis | 2013
Simon Malinowski; Thomas Guyet; René Quiniou; Romain Tavenard
SAX Symbolic Aggregate approXimation is one of the main symbolization techniques for time series. A well-known limitation of SAX is that trends are not taken into account in the symbolization. This paper proposes 1d-SAX a method to represent a time series as a sequence of symbols that each contain information about the average and the trend of the series on a segment. We compare the efficiency of SAX and 1d-SAX in terms of goodness-of-fit, retrieval and classification performance for querying a time series database with an asymmetric scheme. The results show that 1d-SAX improves performance using equal quantity of information, especially when the compression rate increases.
Water Resources Research | 2013
Alice H. Aubert; Romain Tavenard; Rémi Emonet; Alban de Lavenne; Simon Malinowski; Thomas Guyet; René Quiniou; Jean-Marc Odobez; Philippe Merot; Chantal Gascuel-Odoux
To improve hydro-chemical modeling and forecasting, there is a need to better understand flood-induced variability in water chemistry and the processes controlling it in watersheds. In the literature, assumptions are often made, for instance, that stream chemistry reacts differently to rainfall events depending on the season; however, methods to verify such assumptions are not well developed. Often, few floods are studied at a time and chemicals are used as tracers. Grouping similar events from large multivariate datasets using principal component analysis and clustering methods helps to explain hydrological processes; however, these methods currently have some limits (definition of flood descriptors, linear assumption, for instance). Most clustering methods have been used in the context of regionalization, focusing more on mapping results than on understanding processes. In this study, we extracted flood patterns using the probabilistic Latent Dirichlet Allocation (LDA) model, its first use in hydrology, to our knowledge. The LDA method allows multivariate temporal datasets to be considered without having to define explanatory factors beforehand or select representative floods. We analyzed a multivariate dataset from a long-term observatory (Kervidy-Naizin, western France) containing data for four solutes monitored daily for 12 years: nitrate, chloride, dissolved organic carbon, and sulfate. The LDA method extracted four different patterns that were distributed by season. Each pattern can be explained by seasonal hydrological processes. Hydro-meteorological parameters help explain the processes leading to these patterns, which increases understanding of flood-induced variability in water quality. Thus, the LDA method appears useful for analyzing long-term datasets.
intelligent data analysis | 2017
Arnaud Lods; Simon Malinowski; Romain Tavenard; Laurent Amsaleg
Dynamic Time Warping (DTW) is one of the best similarity measures for time series, and it has extensively been used in retrieval, classification or mining applications. It is a costly measure, and applying it to numerous and/or very long times series is difficult in practice. Recently, Shapelet Transform (ST) proved to enable accurate supervised classification of time series. ST learns small subsequences that well discriminate classes, and transforms the time series into vectors lying in a metric space. In this paper, we adopt the ST framework in a novel way: we focus on learning, without class label information, shapelets such that Euclidean distances in the ST-space approximate well the true DTW. nOur approach leads to an ubiquitous representation of time series in a metric space, where any machine learning method (supervised or unsupervised) and indexing system can operate efficiently.
european conference on machine learning | 2016
Romain Tavenard; Simon Malinowski
In time series classification, two antagonist notions are at stake. On the one hand, in most cases, the sooner the time series is classified, the more rewarding. On the other hand, an early classification is more likely to be erroneous. Most of the early classification methods have been designed to take a decision as soon as sufficient level of reliability is reached. However, in many applications, delaying the decision with no guarantee that the reliability threshold will be met in the future can be costly. Recently, a framework dedicated to optimizing a trade-off between classification accuracy and the cost of delaying the decision was proposed, together with an algorithm that decides online the optimal time instant to classify an incoming time series. On top of this framework, we build in this paper two different early classification algorithms that optimize a trade-off between decision accuracy and the cost of delaying the decision. These algorithms are non-myopic in the sense that, even when classification is delayed, they can provide an estimate of when the optimal classification time is likely to occur. Our experiments on real datasets demonstrate that the proposed approaches are more robust than existing methods. The data and software related to this paper are available at https://github.com/rtavenar/CostAware_ECTS.
Computer Communications | 2015
Angelos K. Marnerides; Simon Malinowski; Ricardo Morla; Hyong S. Kim
The adequate operation for a number of service distribution networks relies on the effective maintenance and fault management of their underlay DSL infrastructure. Thus, new tools are required in order to adequately monitor and further diagnose anomalies that other segments of the DSL network cannot identify due to the pragmatic issues raised by hardware or software misconfigurations. In this work we present a fundamentally new approach for classifying known DSL-level anomalies by exploiting the properties of novelty detection via the employment of one-class Support Vector Machines (SVMs). By virtue of the imbalance residing in the training samples that consequently lead to problematic prediction outcomes when used within two-class formulations, we adopt the properties of one-class classification and construct models for independently identifying and classifying a single type of a DSL-level anomaly. Given the fact that the greater number of the installed Digital Subscriber Line Access Multiplexers (DSLAMs) within the DSL network of a large European ISP were misconfigured, thus unable to accurately flag anomalous events, we utilize as inference solutions the models derived by the one-class SVM formulations built by the known labels as flagged by the much smaller number of correctly configured DSLAMs in the same network in order to aid the classification aspect against the monitored unlabeled events. By reaching an average over 95% on a number of classification accuracy metrics such as precision, recall and F-score we show that one-class SVM classifiers overcome the biased classification outcomes achieved by the traditional two-class formulations and that they may constitute as viable and promising components within the design of future network fault management strategies. In addition, we demonstrate their superiority over commonly used two-class machine learning approaches such as Decision Trees and Bayesian Networks that has been used in the same context within past solutions.
global communications conference | 2012
Angelos K. Marnerides; Simon Malinowski; Ricardo Morla; Miguel R. D. Rodrigues; Hyong S. Kim
IPTV networks blindly rely on the adequate operation and management of the underlying infrastructure that in numerous cases is threaten by unexpected anomalous events which consequently cause QoS degradation to the end-user. Thus, it is of great importance to deploy techniques embodied with diagnostic and self-protection metrics for determining and predicting the arrival of such events in order to proactively charge defense mechanisms without the need of an exhaustive manual inspection by the network operator. In this paper we propose and demonstrate the applicability of the Rényi entropy as a useful diagnosis feature for explicitly characterizing DSL-level anomalies issued in an IPTV network of a large European ISP. It is revealed that different orders of the Rényi entropy can formulate meaningful detection and categorization of phenomena occurring on specific Digital Subscriber Line Access Multiplexers (DSLAMs) within the DSL infrastructure. Via the synergistic exploitation of the local maxima peaks generated by each Rényi-based distribution we exhibit the feasibility to extract and identify lightweight anomalies that under simple metrics cannot be detected.
data warehousing and knowledge discovery | 2012
Simon Malinowski; Ricardo Morla
The main paradigm for clustering evolving data streams in the last 10 years has been to divide the clustering process into an online phase that computes and stores detailed statistics about the data in micro-clusters and an offline phase that queries micro-cluster statistics and returns desired clustering structures. The argument for two-phase algorithms is that they support evolving data streams and temporal multi-scale analysis, which single pass algorithms do not. In this paper, we describe a single pass fully online trellis-based algorithm, named ClusTrel, designed for centroid-based clustering that supports evolving data streams and generates clustering structures right after a new point is processed. The performance of ClusTrel is assessed and compared to state of the art algorithms for clustering of data streams showing similar performance with smaller memory footprint.
similarity search and applications | 2018
Ricardo Carlini Sperandio; Simon Malinowski; Laurent Amsaleg; Romain Tavenard
Dynamic Time Warping (DTW) is a very popular similarity measure used for time series classification, retrieval or clustering. DTW is, however, a costly measure, and its application on numerous and/or very long time series is difficult in practice. This paper proposes a new approach for time series retrieval: time series are embedded into another space where the search procedure is less computationally demanding, while still accurate. This approach is based on transforming time series into high-dimensional vectors using DTW-preserving shapelets. That transform is such that the relative distance between the vectors in the Euclidean transformed space well reflects the corresponding DTW measurements in the original space. We also propose strategies for selecting a subset of shapelets in the transformed space, resulting in a trade-off between the complexity of the transformation and the accuracy of the retrieval. Experimental results using the well known UCR time series demonstrate the importance of this trade-off.
european conference on machine learning | 2017
Romain Tavenard; Simon Malinowski; Laetitia Chapel; Adeline Bailly; Heider Sanchez; Benjamin Bustos
In the time-series classification context, the majority of the most accurate core methods are based on the Bag-of-Words framework, in which sets of local features are first extracted from time series. A dictionary of words is then learned and each time series is finally represented by a histogram of word occurrences. This representation induces a loss of information due to the quantization of features into words as all the time series are represented using the same fixed dictionary. In order to overcome this issue, we introduce in this paper a kernel operating directly on sets of features. Then, we extend it to a time-compliant kernel that allows one to take into account the temporal information. We apply this kernel in the time series classification context. Proposed kernel has a quadratic complexity with the size of input feature sets, which is problematic when dealing with long time series. However, we show that kernel approximation techniques can be used to define a good trade-off between accuracy and complexity. We experimentally demonstrate that the proposed kernel can significantly improve the performance of time series classification algorithms based on Bag-of-Words.
revue internationale de géomatique | 2015
Thomas Guyet; Simon Malinowski; Mohand Cherif Benyounès
Cet article presente une methode de segmentation de series temporelles d’images satellite (SITS) en zones coherentes, c’est-a-dire en des regions geographiques ayant des comportements temporels homogenes. L’objectif de cette methode est, d’une part, d’extraire des caracteristiques spatio-temporelles d’une region observee et, d’autre part, d’obtenir cette caracterisation de maniere efficace en temps de calcul pour traiter de grandes masses de donnees. Cette methode est appliquee a la caracterisation des regions agro-ecologiques du Senegal par l’analyse des images MODIS sur un an (23 dates).