Panagiotis Papapetrou

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Panagiotis Papapetrou is active.

Explore More

Publication

Featured researches published by Panagiotis Papapetrou.

international conference on data mining | 2005

Discovering frequent arrangements of temporal intervals

Panagiotis Papapetrou; George Kollios; Stan Sclaroff; Dimitrios Gunopulos

In this paper we study a new problem in temporal pattern mining: discovering frequent arrangements of temporal intervals. We assume that the database consists of sequences of events, where an event occurs during a time-interval. The goal is to mine arrangements of event intervals that appear frequently in the database. There are many applications where these type of patterns can be useful, including data network, scientific, and financial applications. Efficient methods to find frequent arrangements of temporal intervals using both breadth first and depth first search techniques are described. The performance of the proposed algorithms is evaluated and compared with other approaches on real datasets (American sign language streams and network data) and large synthetic datasets.

international conference on data engineering | 2008

Nearest Neighbor Retrieval Using Distance-Based Hashing

Vassilis Athitsos; Michalis Potamias; Panagiotis Papapetrou; George Kollios

A method is proposed for indexing spaces with arbitrary distance measures, so as to achieve efficient approximate nearest neighbor retrieval. Hashing methods, such as locality sensitive hashing (LSH), have been successfully applied for similarity indexing in vector spaces and string spaces under the Hamming distance. The key novelty of the hashing technique proposed here is that it can be applied to spaces with arbitrary distance measures, including non-metric distance measures. First, we describe a domain-independent method for constructing a family of binary hash functions. Then, we use these functions to construct multiple multibit hash tables. We show that the LSH formalism is not applicable for analyzing the behavior of these tables as index structures. We present a novel formulation, that uses statistical observations from sample data to analyze retrieval accuracy and efficiency for the proposed indexing method. Experiments on several real-world data sets demonstrate that our method produces good trade-offs between accuracy and efficiency, and significantly outperforms VP-trees, which are a well-known method for distance-based indexing.

Knowledge and Information Systems | 2009

Mining frequent arrangements of temporal intervals

Panagiotis Papapetrou; George Kollios; Stan Sclaroff; Dimitrios Gunopulos

The problem of discovering frequent arrangements of temporal intervals is studied. It is assumed that the database consists of sequences of events, where an event occurs during a time-interval. The goal is to mine temporal arrangements of event intervals that appear frequently in the database. The motivation of this work is the observation that in practice most events are not instantaneous but occur over a period of time and different events may occur concurrently. Thus, there are many practical applications that require mining such temporal correlations between intervals including the linguistic analysis of annotated data from American Sign Language as well as network and biological data. Three efficient methods to find frequent arrangements of temporal intervals are described; the first two are tree-based and use breadth and depth first search to mine the set of frequent arrangements, whereas the third one is prefix-based. The above methods apply efficient pruning techniques that include a set of constraints that add user-controlled focus into the mining process. Moreover, based on the extracted patterns a standard method for mining association rules is employed that applies different interestingness measures to evaluate the significance of the discovered patterns and rules. The performance of the proposed algorithms is evaluated and compared with other approaches on real (American Sign Language annotations and network data) and large synthetic datasets.

international conference on management of data | 2008

Approximate embedding-based subsequence matching of time series

Vassilis Athitsos; Panagiotis Papapetrou; Michalis Potamias; George Kollios; Dimitrios Gunopulos

A method for approximate subsequence matching is introduced, that significantly improves the efficiency of subsequence matching in large time series data sets under the dynamic time warping (DTW) distance measure. Our method is called EBSM, shorthand for Embedding-Based Subsequence Matching. The key idea is to convert subsequence matching to vector matching using an embedding. This embedding maps each database time series into a sequence of vectors, so that every step of every time series in the database is mapped to a vector. The embedding is computed by applying full dynamic time warping between reference objects and each database time series. At runtime, given a query object, an embedding of that object is computed in the same manner, by running dynamic time warping between the reference objects and the query. Comparing the embedding of the query with the database vectors is used to efficiently identify relatively few areas of interest in the database sequences. Those areas of interest are then fully explored using the exact DTW-based subsequence matching algorithm. Experiments on a large, public time series data set produce speedups of over one order of magnitude compared to brute-force search, with very small losses (< 1%) in retrieval accuracy.

ACM Transactions on Database Systems | 2011

Embedding-based subsequence matching in time-series databases

Panagiotis Papapetrou; Vassilis Athitsos; Michalis Potamias; George Kollios; Dimitrios Gunopulos

We propose an embedding-based framework for subsequence matching in time-series databases that improves the efficiency of processing subsequence matching queries under the Dynamic Time Warping (DTW) distance measure. This framework partially reduces subsequence matching to vector matching, using an embedding that maps each query sequence to a vector and each database time series into a sequence of vectors. The database embedding is computed offline, as a preprocessing step. At runtime, given a query object, an embedding of that object is computed online. Relatively few areas of interest are efficiently identified in the database sequences by comparing the embedding of the query with the database vectors. Those areas of interest are then fully explored using the exact DTW-based subsequence matching algorithm. We apply the proposed framework to define two specific methods. The first method focuses on time-series subsequence matching under unconstrained Dynamic Time Warping. The second method targets subsequence matching under constrained Dynamic Time Warping (cDTW), where warping paths are not allowed to stray too much off the diagonal. In our experiments, good trade-offs between retrieval accuracy and retrieval efficiency are obtained for both methods, and the results are competitive with respect to current state-of-the-art methods.

Data Mining and Knowledge Discovery | 2014

A statistical significance testing approach to mining the most informative set of patterns

Jefrey Lijffijt; Panagiotis Papapetrou; Kai Puolamäki

Hypothesis testing using constrained null models can be used to compute the significance of data mining results given what is already known about the data. We study the novel problem of finding the smallest set of patterns that explains most about the data in terms of a global p value. The resulting set of patterns, such as frequent patterns or clusterings, is the smallest set that statistically explains the data. We show that the newly formulated problem is, in its general form, NP-hard and there exists no efficient algorithm with finite approximation ratio. However, we show that in a special case a solution can be computed efficiently with a provable approximation ratio. We find that a greedy algorithm gives good results on real data and that, using our approach, we can formulate and solve many known data-mining tasks. We demonstrate our method on several data mining tasks. We conclude that our framework is able to identify in various settings a small set of patterns that statistically explains the data and to formulate data mining problems in the terms of statistical significance.

pervasive technologies related to assistive environments | 2012

A survey of query-by-humming similarity methods

Alexios Kotsifakos; Panagiotis Papapetrou; Jaakko Hollmén; Dimitrios Gunopulos; Vassilis Athitsos

Performing similarity search in large databases is a problem of particular interest in many communities, such as music, database, and data mining. Although several solutions have been proposed in the literature that perform well in many application domains, there is no best method to solve this kind of problem in a Query-By-Humming (QBH) application. In QBH the goal is to find the song(s) most similar to a hummed query in an efficient manner. In this paper, we focus on providing a brief overview of the representations to encode music pieces, and also on the methods that have been proposed for QBH or other similarly defined problems.

Literary and Linguistic Computing | 2016

Significance testing of word frequencies in corpora

Jefrey Lijffijt; Terttu Nevalainen; Tanja Säily; Panagiotis Papapetrou; Kai Puolamäki; Heikki Mannila

Finding out whether a word occurs significantly more often in one text or corpus than in another is an important question in analysing corpora. As noted by Kilgarriff (Language is never, ever, ever, random, Corpus Linguistics and Linguistic Theory , 2005; 1(2): 263–76.), the use of the χ2 and log-likelihood ratio tests is problematic in this context, as they are based on the assumption that all samples are statistically independent of each other. However, words within a text are not independent. As pointed out in Kilgarriff (Comparing corpora, International Journal of Corpus Linguistics , 2001; 6(1): 1–37) and Paquot and Bestgen (Distinctive words in academic writing: a comparison of three statistical tests for keyword extraction. In Jucker, A., Schreier, D., and Hundt, M. (eds), Corpora: Pragmatics and Discourse . Amsterdam: Rodopi, 2009, pp. 247–69), it is possible to represent the data differently and employ other tests, such that we assume independence at the level of texts rather than individual words. This allows us to account for the distribution of words within a corpus. In this article we compare the significance estimates of various statistical tests in a controlled resampling experiment and in a practical setting, studying differences between texts produced by male and female fiction writers in the British National Corpus. We find that the choice of the test, and hence data representation, matters. We conclude that significance testing can be used to find consequential differences between corpora, but that assuming independence between all words may lead to overestimating the significance of the observed differences, especially for poorly dispersed words. We recommend the use of the t-test, Wilcoxon rank-sum test, or bootstrap test for comparing word frequencies across corpora.

pervasive technologies related to assistive environments | 2010

Benchmarking dynamic time warping for music retrieval

Jefrey Lijffijt; Panagiotis Papapetrou; Jaakko Hollmén; Vassilis Athitsos

We study the performance of three dynamic programming methods on music retrieval. The methods are designed for time series matching but can be directly applied to retrieval of music. Dynamic Time Warping (DTW) identifies an optimal alignment between two time series, and computes the matching cost corresponding to that alignment. Significant speed-ups can be achieved by constrained Dynamic Time Warping (cDTW), which narrows down the set of positions in one time series that can be matched with specific positions in the other time series. Both methods are designed for full sequence matching but can also be applied for subsequence matching, by using a sliding window over each database sequence to compute a matching score for each database subsequence. In addition, SPRING is a dynamic programming approach designed for subsequence matching, where the query is matched with a database subsequence without requiring the match length to be equal to the query length. SPRING has a lower computational cost than DTW and cDTW. Our database consists of a set of MIDI files taken from the web. Each MIDI file has been converted to a 2-dimensional time series, taking into account both note pitches and durations. We have used synthetic queries of fixed size and different noise levels. Surprisingly, when looking for the top-K best matches, all three approaches show similar behavior in terms of retrieval accuracy for small values of K. This suggests that for the specific application area, a computationally cheaper method, such as SPRING, is sufficient to retrieve the best top-K matches.

european conference on machine learning | 2011

ARTEMIS: assessing the similarity of event-interval sequences

Orestis Kostakis; Panagiotis Papapetrou; Jaakko Hollmén

In several application domains, such as sign language, medicine, and sensor networks, events are not necessarily instantaneous but they can have a time duration. Sequences of interval-based events may contain useful domain knowledge; thus, searching, indexing, and mining such sequences is crucial. We introduce two distance measures for comparing sequences of interval-based events which can be used for several data mining tasks such as classification and clustering. The first measure maps each sequence of interval-based events to a set of vectors that hold information about all concurrent events. These sets are then compared using an existing dynamic programming method. The second method, called Artemis, finds correspondence between intervals by mapping the two sequences into a bipartite graph. Similarity is inferred by employing the Hungarian algorithm. In addition, we present a linear-time lowerbound for Artemis. The performance of both measures is tested on data from three domains: sign language, medicine, and sensor networks. Experiments show the superiority of Artemis in terms of robustness to high levels of artificially introduced noise.

Explore More