Vasilis Efthymiou
University of Crete
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Vasilis Efthymiou.
international world wide web conferences | 2014
Kostas Stefanidis; Vasilis Efthymiou; Melanie Herschel; Vassilis Christophides
This tutorial provides an overview of the key research results in the area of entity resolution that are relevant to addressing the new challenges in entity resolution posed by the Web of data, in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking but also to search the Web of data for entities and their relations.
international conference on big data | 2015
Vasilis Efthymiou; George Papadakis; George Papastefanatos; Kostas Stefanidis; Themis Palpanas
Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. Typically, it scales to large volumes of data through blocking: similar entities are clustered into blocks so that it suffices to perform comparisons only within each block. Meta-blocking further increases efficiency by cleaning the overlapping blocks from unnecessary comparisons. However, even Meta-blocking can be time-consuming: applying it to blocks with 7.4 million entities and 2.21011 comparisons takes almost 8 days on a modern high-end server. In this paper, we parallelize Meta-blocking based on MapReduce. We propose a simple strategy that explicitly creates the core concept of Meta-blocking, the blocking graph. We then describe an advanced strategy that creates the blocking graph implicitly, reducing the overhead of data exchange. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the superiority of our advanced strategy and demonstrates an almost linear speedup for all meta-blocking techniques with respect to the number of available nodes.
Information Systems | 2017
Vasilis Efthymiou; George Papadakis; George Papastefanatos; Kostas Stefanidis; Themis Palpanas
Entity resolution constitutes a crucial task for many applications, but has an inherently quadratic complexity. In order to enable entity resolution to scale to large volumes of data, blocking is typically employed: it clusters similar entities into (overlapping) blocks so that it suffices to perform comparisons only within each block. To further increase efficiency, Meta-blocking is being used to clean the overlapping blocks from unnecessary comparisons, increasing precision by orders of magnitude at a small cost in recall. Despite its high time efficiency though, using Meta-blocking in practice to solve entity resolution problem on very large datasets is still challenging: applying it to 7.4million entities takes (almost) 8 full days on a modern high-end server.In this paper, we introduce scalable algorithms for Meta-blocking, exploiting the MapReduce framework. Specifically, we describe a strategy for parallel execution that explicitly targets the core concept of Meta-blocking, the blocking graph. Furthermore, we propose two more advanced strategies, aiming to reduce the overhead of data exchange. The comparison-based strategy creates the blocking graph implicitly, while the entity-based strategy is independent of the blocking graph, employing fewer MapReduce jobs with a more elaborate processing. We also introduce a load balancing algorithm that distributes the computationally intensive workload evenly among the available compute nodes. Our experimental analysis verifies the feasibility and superiority of our advanced strategies, and demonstrates their scalability to very large datasets. HighlightsWe adapt Meta-blocking to the MapReduce paradigm through 3 alternative parallelization strategies: an edge-based strategy that explicitly builds the blocking graph, a comparison-based strategy that uses the blocking graph implicitly, as a conceptual model, and an entity-based strategy that is independent of the blocking graph. We also provide concrete implementations for all weighting schemes that are used in Meta-blocking.We present a load balancing technique that deals with skewness in the input block collection, splitting it into partitions of the same computational cost.We verify the scalability of our techniques through a thorough experimental evaluation over the four largest, real datasets that have been applied to Meta-blocking. The data and the implementation of our techniques are publicly available.
international conference on big data | 2015
Vasilis Efthymiou; Kostas Stefanidis; Vassilis Christophides
In the Web of data, entities are described by interlinked data rather than documents on the Web. In this work, we focus on entity resolution in the Web of data, i.e., identifying descriptions that refer to the same real-world entity. To reduce the required number of pairwise comparisons, methods for entity resolution perform blocking as a pre-processing step. A blocking technique places similar entity descriptions into blocks and executes comparisons only between descriptions within the same block. We experimentally evaluate blocking techniques proposed for the Web of data and present dataset characteristics that determine the effectiveness and efficiency of such methods. Furthermore, we analyze the characteristics of the missed matching entity descriptions and examine different types of links that blocking techniques can potentially identify.
IEEE Transactions on Big Data | 2016
Vasilis Efthymiou; Kostas Stefanidis; Vassilis Christophides
An increasing number of entities are described by interlinked data rather than documents on the Web. Entity Resolution (ER) aims to identify descriptions of the same real-world entity within one or across knowledge bases in the Web of data. To reduce the required number of pairwise comparisons among descriptions, ER methods typically perform a pre-processing step, called blocking, which places similar entity descriptions into blocks and thus only compare descriptions within the same block. We experimentally evaluate several blocking methods proposed for the Web of data using real datasets, whose characteristics significantly impact their effectiveness and efficiency. The proposed experimental evaluation framework allows us to better understand the characteristics of the missed matching entity descriptions and contrast them with ground truth obtained from different kinds of relatedness links.
international conference on big data | 2015
Vasilis Efthymiou; Kostas Stefanidis; Eirini Ntoutsi
Top-k is a well-studied problem in the literature, due to its wide spectrum of applications, like information retrieval, database querying, Web search and data mining. In the big data era, the volume of the data and their velocity, call for efficient parallel solutions that overcome the restricted resources of a single machine. Our motivating application is recommenders, which typically deal with big numbers of users and items, but other applications might benefit as well, like keyword search. In this paper, we propose a parallel top-k MapReduce algorithm that, unlike existing MapReduce solutions, manages to handle cases in which the k results do not fit in memory.
international semantic web conference | 2017
Vasilis Efthymiou; Oktie Hassanzadeh; Mariano Rodriguez-Muro; Vassilis Christophides
Web tables constitute valuable sources of information for various applications, ranging from Web search to Knowledge Base (KB) augmentation. An underlying common requirement is to annotate the rows of Web tables with semantically rich descriptions of entities published in Web KBs. In this paper, we evaluate three unsupervised annotation methods: (a) a lookup-based method which relies on the minimal entity context provided in Web tables to discover correspondences to the KB, (b) a semantic embeddings method that exploits a vectorial representation of the rich entity context in a KB to identify the most relevant subset of entities in the Web table, and (c) an ontology matching method, which exploits schematic and instance information of entities available both in a KB and a Web table. Our experimental evaluation is conducted using two existing benchmark data sets in addition to a new large-scale benchmark created using Wikipedia tables. Our results show that: (1) our novel lookup-based method outperforms state-of-the-art lookup-based methods, (2) the semantic embeddings method outperforms lookup-based methods in one benchmark data set, and (3) the lack of a rich schema in Web tables can limit the ability of ontology matching tools in performing high-quality table annotation. As a result, we propose a hybrid method that significantly outperforms individual methods on all the benchmarks.
extending database technology | 2016
Vasilis Efthymiou; Kostas Stefanidis; Vassilis Christophides
Entity resolution aims to identify descriptions of the same entity within or across knowledge bases. In this work, we present the Minoan ER platform for resolving entities described by linked data in the Web (e.g., in RDF). To reduce the required number of comparisons, Minoan ER performs blocking to place similar descriptions into blocks and executes comparisons to identify matches only between descriptions within the same block. Moreover, it explores in a pay-as-you-go fashion any intermediate results of matching to obtain similarity evidence of entity neighbors and discover new candidate description pairs for resolution.
international conference on data engineering | 2017
Kostas Stefanidis; Vassilis Christophides; Vasilis Efthymiou
Entity resolution aims to identify descriptions of the same entity within or across knowledge bases. In this work, we provide a comprehensive and cohesive overview of the key research results in the area of entity resolution. We are interested in frameworks addressing the new challenges in entity resolution posed by the Web of data in which real world entities are described by interlinked data rather than documents. Since such descriptions are usually partial, overlapping and sometimes evolving, entity resolution emerges as a central problem both to increase dataset linking, but also to search the Web of data for entities and their relations. We focus on Web-scale blocking, iterative and progressive solutions for entity resolution. Specifically, to reduce the required number of comparisons, blocking is performed to place similar descriptions into blocks and executes comparisons to identify matches only between descriptions within the same block. To minimize the number of missed matches, an iterative entity resolution process can exploit any intermediate results of blocking and matching, discovering new candidate description pairs for resolution. Finally, we overview works on progressive entity resolution, which attempt to discover as many matches as possible given limited computing budget, by estimating the matching likelihood of yet unresolved descriptions, based on the matches found so far.
Archive | 2017
Vasilis Efthymiou; Petros Zervoudakis; Kostas Stefanidis; Dimitris Plexousakis
Recommender systems have received significant attention, with most of the proposed methods focusing on recommendations for single users. However, there are contexts in which the items to be suggested are not intended for a user but for a group of people. For example, assume a group of friends or a family that is planning to watch a movie or visit a restaurant. In this paper, we propose an extensive model for group recommendations that exploits recommendations for items that similar users to the group members liked in the past. We follow two different approaches for offering recommendations to the members of a group: considering the members of a group as a single user, and recommending to this user items that similar users liked, or estimating first how much each group member would like an item, and then, recommend the items that would (dis)satisfy the most (least) members of the group. For each of the two approaches, we introduce a different MapReduce algorithm, and evaluate the results in real data from the movie industry.