George Giannakopoulos
University of Trento
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by George Giannakopoulos.
web intelligence, mining and semantics | 2012
George Giannakopoulos; Petra Mavridi; Georgios Paliouras; George Papadakis; Konstantinos Tserpes
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. To address it, patterns of co-occurring words or characters are typically extracted from the textual content of Web documents. However, not all documents are of the same quality; for example, the curated content of news articles usually entails lower levels of noise than the user-generated content of the blog posts and the other Social Media. In this paper, we provide some insight and a preliminary study on a tripartite categorization of Web documents, based on inherent document characteristics. We claim and support that each category calls for different classification settings with respect to the representation model. We verify this claim experimentally, by showing that topic classification on these different document types offers very different results per type. In addition, we consider a novel approach that improves the performance of topic classification across all types of Web documents: namely the n-gram graphs. This model goes beyond the established bag-of-words one, representing each document as a graph. Individual graphs can be combined into a class graph and graph similarities are then employed to position and classify documents into the vector space. Accuracy is increased due to the contextual information that is encapsulated in the edges of the n-gram graphs; efficiency, on the other hand, is boosted by reducing the feature space to a limited set of dimensions that depend on the number of classes, rather than the size of the vocabulary. Our experimental study over three large-scale, real-world data sets validates the higher performance of n-gram graphs in all three domains of Web documents.
international conference on web engineering | 2017
Sören Auer; Simon Scerri; Aad Versteden; Erika Pauwels; Angelos Charalambidis; Stasinos Konstantopoulos; Jens Lehmann; Hajira Jabeen; Ivan Ermilov; Gezim Sejdiu; Andreas Ikonomopoulos; Spyros Andronopoulos; Mandy Vlachogiannis; Charalambos Pappas; Athanasios Davettas; Iraklis A. Klampanos; Efstathios Grigoropoulos; Vangelis Karkaletsis; Victor de Boer; Ronald Siebes; Mohamed Nadjib Mami; Sergio Albani; Michele Lazzarini; Paulo Nunes; Emanuele Angiuli; Nikiforos Pittaras; George Giannakopoulos; Giorgos Argyriou; George Stamoulis; George Papadakis
The management and analysis of large-scale datasets – described with the term Big Data – involves the three classic dimensions volume, velocity and variety. While the former two are well supported by a plethora of software components, the variety dimension is still rather neglected. We present the BDE platform – an easy-to-deploy, easy-to-use and adaptable (cluster-based and standalone) platform for the execution of big data components and tools like Hadoop, Spark, Flink, Flume and Cassandra. The BDE platform was designed based upon the requirements gathered from seven of the societal challenges put forward by the European Commission in the Horizon 2020 programme and targeted by the BigDataEurope pilots. As a result, the BDE platform allows to perform a variety of Big Data flow tasks like message passing, storage, analysis or publishing. To facilitate the processing of heterogeneous data, a particular innovation of the platform is the Semantic Layer, which allows to directly process RDF data and to map and transform arbitrary data into RDF. The advantages of the BDE platform are demonstrated through seven pilots, each focusing on a major societal challenge.
acm ieee joint conference on digital libraries | 2011
George Papadakis; George Giannakopoulos; Claudia Niederée; Themis Palpanas; Wolfgang Nejdl
Individuals contribute content on the Web at an unprecedented rate, accumulating immense quantities of (semi-)structured data. Wisdom of the Crowds theory advocates that such information (or parts of it) is constantly overwritten, updated, or even deleted by other users, with the goal of rendering it more accurate, or up-to-date. This is particularly true for the collaboratively edited, semi-structured data of entity repositories, whose entity profiles are consistently kept fresh. Therefore, their core information that remain stable with the passage of time, despite being reviewed by numerous users, are particularly useful for the description of an entity. Based on the above hypothesis, we introduce a classification scheme that predicts, on the basis of statistical and content patterns, whether an attribute (i.e., name-value pair) is going to be modified in the future. We apply our scheme on a large, real-world, versioned dataset and verify its effectiveness. Our thorough experimental study also suggests that reducing entity profiles to their stable parts conveys significant benefits to two common tasks in computer science: information retrieval and information integration.
international conference on data mining | 2010
George Giannakopoulos; Themistoklis Palpanas
In several concept attainment systems, ranging from recommendation systems to information filtering, a sliding window of learning instances has been used in the learning process to allow the learner to follow concepts that change over time. However, no analytic study has been performed on the relation between the size of the sliding window and the performance of a learning system. In this work, we present such an analytic model that describes the effect of the sliding window size on the prediction performance of a learning system based on iterative feedback. Using a signal-to-noise approach to model the learning ability of the underlying machine learning algorithms, we can provide good estimates of the average performance of a modeling system independently of the supervised machine learning algorithm employed. We experimentally validate the effectiveness of the proposed methodology with detailed experiments using synthetic and real datasets, and a variety of learning algorithms, including Support Vector Machines, Naive Bayes, Nearest Neighbor and Decision Trees. The results validate the analysis and indicate very good estimation performance in different settings.
Knowledge and Information Systems | 2013
George Giannakopoulos; Themis Palpanas
In a variety of settings ranging from recommendation systems to information filtering, approaches which take into account feedback have been introduced to improve services and user experience. However, as also indicated in the machine learning literature, there exist several settings where the requirements and target concept of a learning system changes over time, which consists a case of “concept drift”. In several systems, a sliding window over the training instances has been used to follow drifting concepts. However, no general analytic study has been performed on the relation between the size of the sliding window and the average performance of a learning system, since previous works have focused on instantaneous performance and specific underlying learners and data characteristics. This work proposes an analytic model that describes the effect of memory window size on the prediction performance of a learning system that is based on iterative feedback. The analysis considers target concepts changing over time, either periodically or randomly, using a formulation termed “the problem of the demanding lord”. Using a signal-to-noise approach to sketch learning ability of underlying machine learning algorithms, we estimate the average performance of a learning system regardless of its underlying algorithm and, as a corollary, propose a stepping stone toward finding the memory window that maximizes the average performance for a given drift setting and a specific modeling system. We experimentally support the proposed methodology with very promising results on three synthetic and four real datasets, using a variety of learning algorithms including Support Vector Machines, Naive Bayes, Nearest Neighbor and Decision Trees on classification and regression tasks. The results validate the analysis and indicate very good estimation performance in different settings.
World Wide Web | 2016
George Papadakis; George Giannakopoulos; Georgios Paliouras
Text classification constitutes a popular task in Web research with various applications that range from spam filtering to sentiment analysis. In this paper, we argue that its performance depends on the quality of Web documents, which varies significantly. For example, the curated content of news articles involves different challenges than the user-generated content of blog posts and Social Media messages. We experimentally verify our claim, quantifying the main factors that affect the performance of text classification. We also argue that the established bag-of-words representation models are inadequate for handling all document types, as they merely extract frequent, yet distinguishing terms from the textual content of the training set. Thus, they suffer from low robustness in the context of noisy or unseen content, unless they are enriched with contextual, application-specific information. In their place, we propose the use of n-gram graphs, a model that goes beyond the bag-of-words representation, transforming every document into a graph: its nodes correspond to character or word n-grams and the co-occurring ones are connected by weighted edges. Individual document graphs can be combined into class graphs and graph similarities are employed to position and classify documents into the vector space. This approach offers two advantages with respect to bag models: first, classification accuracy increases due to the contextual information that is encapsulated in the edges of the n-gram graphs. Second, it reduces the search space to a limited set of robust, endogenous features that depend on the number of classes, rather than the size of the vocabulary. Our thorough experimental study over three large, real-world corpora confirms the superior performance of n-gram graphs across the main types of Web documents.
recent advances in natural language processing | 2017
Leonidas Tsekouras; Iraklis Varlamis; George Giannakopoulos
Text comparison is an interesting though hard task, with many applications in Natural Language Processing. This work introduces a new text-similarity measure, which employs named-entities’ information extracted from the texts and the n-gram graphs’ model for representing documents. Using OpenCalais as a named-entity recognition service and the JINSECT toolkit for constructing and managing n-gram graphs, the text similarity measure is embedded in a text clustering algorithm (k-Means). The evaluation of the produced clusters with various clustering validity metrics shows that the extraction of named entities at a first step can be profitable for the time-performance of similarity measures that are based on the n-gram graph representation without affecting the overall performance of the NLP task.
european semantic web conference | 2017
George Papadakis; Leonidas Tsekouras; Emmanouil Thanos; George Giannakopoulos; Themis Palpanas; Manolis Koubarakis
We present JedAI, a toolkit for Entity Resolution that can be used in three different ways: as an open-source Java library that implements numerous state-of-the-art, domain-independent methods, as a workbench that facilitates the evaluation of their relative performance and as a desktop application that offers out-of-the-box ER solutions. JedAI bridges the gap between the database and the Semantic Web communities, offering solutions that are applicable to both relational and RDF data. It also conveys a modular architecture that facilitates its extension with more methods and with more comprehensive workflows.
Data Mining and Knowledge Discovery | 2017
Katsiaryna Mirylenka; George Giannakopoulos; Le Minh Do; Themis Palpanas
Machine learning algorithms perform differently in settings with varying levels of training set mislabeling noise. Therefore, the choice of the right algorithm for a particular learning problem is crucial. The contribution of this paper is towards two, dual problems: first, comparing algorithm behavior; and second, choosing learning algorithms for noisy settings. We present the “sigmoid rule” framework, which can be used to choose the most appropriate learning algorithm depending on the properties of noise in a classification problem. The framework uses an existing model of the expected performance of learning algorithms as a sigmoid function of the signal-to-noise ratio in the training instances. We study the characteristics of the sigmoid function using five representative non-sequential classifiers, namely, Naïve Bayes, kNN, SVM, a decision tree classifier, and a rule-based classifier, and three widely used sequential classifiers based on hidden Markov models, conditional random fields and recursive neural networks. Based on the sigmoid parameters we define a set of intuitive criteria that are useful for comparing the behavior of learning algorithms in the presence of noise. Furthermore, we show that there is a connection between these parameters and the characteristics of the underlying dataset, showing that we can estimate an expected performance over a dataset regardless of the underlying algorithm. The framework is applicable to concept drift scenarios, including modeling user behavior over time, and mining of noisy time series of evolving nature.
International Conference on Algorithms for Computational Biology | 2014
Dimitris Polychronopoulos; Anastasia Krithara; Christoforos Nikolaou; Georgios Paliouras; Yannis Almirantis; George Giannakopoulos
Most common methods for inquiring genomic sequence composition, are based on the bag-of-words approach and thus largely ignore the original sequence structure or the relative positioning of its constituent oligonucleotides. We here present a novel methodology that takes into account both word representation and relative positioning at various lengths scales in the form of n-gram graphs (NGG). We implemented the NGG approach on short vertebrate and invertebrate constrained genomic sequences of various origins and predicted functionalities and were able to efficiently distinguish DNA sequences belonging to the same species (intra-species classification). As an alternative method, we also applied the Genomic Signatures (GS) approach to the same sequences. To our knowledge, this is the first time that GS are applied on short sequences, rather than whole genomes. Together, the presented results suggest that NGG is an efficient method for classifying sequences, originating from a given genome, according to their function.