Ekaterini Ioannou | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ekaterini Ioannou is active.

Explore More

Publication

Featured researches published by Ekaterini Ioannou.

web search and data mining | 2011

Efficient entity resolution for large heterogeneous information spaces

George Papadakis; Ekaterini Ioannou; Claudia Niederée; Peter Fankhauser

We have recently witnessed an enormous growth in the volume of structured and semi-structured data sets available on the Web. An important prerequisite for using and combining such data sets is the detection and merge of information that describes the same real-world entities, a task known as Entity Resolution. To make this quadratic task efficient, blocking techniques are typically employed. However, the high dynamics, loose schema binding, and heterogeneity of (semi-)structured data, impose new challenges to entity resolution. Existing blocking approaches become inapplicable because they rely on the homogeneity of the considered data and a-priory known schemata. In this paper, we introduce a novel approach for entity resolution, scaling it up for large, noisy, and heterogeneous information spaces. It combines an attribute-agnostic mechanism for building blocks with intelligent block processing techniques that boost blocks with high expected utility, propagate knowledge about identified matches, and preempt the resolution process when it gets too expensive. Our extensive evaluation on real-world, large, heterogeneous data sets verifies that the suggested approach is both effective and efficient.

web search and data mining | 2012

Beyond 100 million entities: large-scale blocking-based resolution for heterogeneous data

George Papadakis; Ekaterini Ioannou; Claudia Niederée; Themis Palpanas; Wolfgang Nejdl

A prerequisite for leveraging the vast amount of data available on the Web is Entity Resolution, i.e., the process of identifying and linking data that describe the same real-world objects. To make this inherently quadratic process applicable to large data sets, blocking is typically employed: entities (records) are grouped into clusters - the blocks - of matching candidates and only entities of the same block are compared. However, novel blocking techniques are required for dealing with the noisy, heterogeneous, semi-structured, user-generateddata in the Web, as traditional blocking techniques are inapplicable due to their reliance on schema information. The introduction of redundancy, improves the robustness of blocking methods but comes at the price of additional computational cost. In this paper, we present methods for enhancing the efficiency of redundancy-bearing blocking methods, such as our attribute-agnostic blocking approach. We introduce novel blocking schemes that build blocks based on a variety of evidences, including entity identifiers and relationships between entities; they significantly reduce the required number of comparisons, while maintaining blocking effectiveness at very high levels. We also introduce two theoretical measures that provide a reliable estimation of the performance of a blocking method, without requiring the analytical processing of its blocks. Based on these measures, we develop two techniques for improving the performance of blocking: combining individual, complementary blocking schemes, and purging blocks until given criteria are satisfied. We test our methods through an extensive experimental evaluation, using a voluminous data set with 182 million heterogeneous entities. The outcomes of our study show the applicability and the high performance of our approach.

extending database technology | 2011

Efficient discovery of frequent subgraph patterns in uncertain graph databases

Odysseas Papapetrou; Ekaterini Ioannou; Dimitrios Skoutas

Mining frequent subgraph patterns in graph databases is a challenging and important problem with applications in several domains. Recently, there is a growing interest in generalizing the problem to uncertain graphs, which can model the inherent uncertainty in the data of many applications. The main difficulty in solving this problem results from the large number of candidate subgraph patterns to be examined and the large number of subgraph isomorphism tests required to find the graphs that contain a given pattern. The latter becomes even more challenging, when dealing with uncertain graphs. In this paper, we propose a method that uses an index of the uncertain graph database to reduce the number of comparisons needed to find frequent subgraph patterns. The proposed algorithm relies on the apriori property for enumerating candidate subgraph patterns efficiently. Then, the index is used to reduce the number of comparisons required for computing the expected support of each candidate pattern. It also enables additional optimizations with respect to scheduling and early termination, that further increase the efficiency of the method. The evaluation of our approach on three real-world datasets as well as on synthetic uncertain graph databases demonstrates the significant cost savings with respect to the state-of-the-art approach.

acm ieee joint conference on digital libraries | 2011

Eliminating the redundancy in blocking-based entity resolution methods

George Papadakis; Ekaterini Ioannou; Claudia Niederée; Themis Palpanas; Wolfgang Nejdl

Entity resolution is the task of identifying entities that refer to the same real-world object. It has important applications in the context of digital libraries, such as citation matching and author disambiguation. Blocking is an established methodology for efficiently addressing this problem; it clusters similar entities together, and compares solely entities inside each cluster. In order to effectively deal with the current large, noisy and heterogeneous data collections, novel blocking methods that rely on redundancy have been introduced: they associate each entity with multiple blocks in order to increase recall, thus increasing the computational cost, as well. In this paper, we introduce novel techniques that remove the superfluous comparisons from any redundancy-based blocking method. They improve the time-efficiency of the latter without any impact on the end result. We present the optimal solution to this problem that discards all redundant comparisons at the cost of quadratic space complexity. For applications with space limitations, we also present an alternative, lightweight solution that operates at the abstract level of blocks in order to discard a significant part of the redundant comparisons. We evaluate our techniques on two large, real-world data sets and verify the significant improvements they convey when integrated into existing blocking methods.

conference on advanced information systems engineering | 2008

Probabilistic Entity Linkage for Heterogeneous Information Spaces

Ekaterini Ioannou; Claudia Niederée; Wolfgang Nejdl

Heterogeneous information spaces are typically created by merging data from a variety of different applications and information sources. These sources often use different identifiers for data that describe the same real-word entity (for example an artist, a conference, an organization). In this paper we propose a new probabilistic Entity Linkagealgorithm for identifying and linking data that refer to the same real-world entity. Our approach focuses on managing entity linkage information in heterogeneous information spaces using probabilistic methods. We use a Bayesian network to model evidences which support the possible object matches along with the interdependencies between them. This enables us to flexibly update the network when new information becomes available, and to cope with the different requirements imposed by applications build on top of information spaces.

Journal of Web Semantics | 2010

Leveraging personal metadata for Desktop search: The Beagle++ system

Enrico Minack; Raluca Paiu; Stefania Costache; Gianluca Demartini; Julien Gaugaz; Ekaterini Ioannou; Paul-Alexandru Chirita; Wolfgang Nejdl

Search on PCs has become less efficient than searching the Web due to the increasing amount of stored data. In this paper we present an innovative Desktop search solution, which relies on extracted metadata, context information as well as additional background information for improving Desktop search results. We also present a practical application of this approach-the extensible Beagle^+^+ toolbox. To prove the validity of our approach, we conducted a series of experiments. By comparing our results against the ones of a regular Desktop search solution - Beagle - we show an improved quality in search and overall performance.

international semantic web conference | 2010

Efficient semantic-aware detection of near duplicate resources

Ekaterini Ioannou; Odysseas Papapetrou; Dimitrios Skoutas; Wolfgang Nejdl

Efficiently detecting near duplicate resources is an important task when integrating information from various sources and applications. Once detected, near duplicate resources can be grouped together, merged, or removed, in order to avoid repetition and redundancy, and to increase the diversity in the information provided to the user. In this paper, we introduce an approach for efficient semantic-aware near duplicate detection, by combining an indexing scheme for similarity search with the RDF representations of the resources. We provide a probabilistic analysis for the correctness of the suggested approach, which allows applications to configure it for satisfying their specific quality requirements. Our experimental evaluation on the RDF descriptions of real-world news articles from various news agencies demonstrates the efficiency and effectiveness of our approach.

conference on advanced information systems engineering | 2010

From web data to entities and back

Zoltán Miklós; Nicolas Bonvin; Paolo Bouquet; Michele Catasta; Daniele Cordioli; Peter Fankhauser; Julien Gaugaz; Ekaterini Ioannou; Hristo Koshutanski; Antonio Maña; Claudia Niederée; Themis Palpanas; Heiko Stoermer

We present the Entity Name System (ENS), an enabling infrastructure, which can host descriptions of named entities and provide unique identifiers, on large-scale. In this way, it opens new perspectives to realize entity-oriented, rather than keyword-oriented, Web information systems. We describe the architecture and the functionality of the ENS, along with tools, which all contribute to realize the Web of entities.

Proceedings of the International Workshop on Semantic Web Information Management | 2011

To compare or not to compare: making entity resolution more efficient

George Papadakis; Ekaterini Ioannou; Claudia Niederée; Themis Palpanas; Wolfgang Nejdl

Blocking methods are crucial for making the inherently quadratic task of Entity Resolution more efficient. The blocking methods proposed in the literature rely on the homogeneity of data and the availability of binding schema information; thus, they are inapplicable to the voluminous, noisy, and highly heterogeneous data of the Web 2.0 user-generated content. To deal with such data, attribute-agnostic blocking has been recently introduced, following a two-fold strategy: the first layer places entities into overlapping blocks in order to achieve high effectiveness, while the second layer reduces the number of unnecessary comparisons in order to enhance efficiency. In this paper, we present a set of techniques that can be plugged into the second strategy layer of attribute-agnostic blocking to further improve its efficiency. We introduce a technique that eliminates redundant comparisons, and, based on this, we incorporate an approximate method for pruning comparisons that are highly likely to involve non-matching entities. We also introduce a novel measure for quantifying the redundancy a blocking method entails and explain how it can be used to a-priori tune the process of comparisons pruning. We apply our blocking techniques on two large, real-world data sets and report results that demonstrate a substantial increase in efficiency at a negligible (if any) cost in effectiveness.

international conference on web engineering | 2010

Efficient term cloud generation for streaming web content

Odysseas Papapetrou; George Papadakis; Ekaterini Ioannou; Dimitrios Skoutas

Large amounts of information are posted daily on the Web, such as articles published online by traditional news agencies or blog posts referring to and commenting on various events. Although the users sometimes rely on a small set of trusted sources from which to get their information, they often also want to get a wider overview and glimpse of what is being reported and discussed in the news and the blogosphere. In this paper, we present an approach for supporting this discovery and exploration process by exploiting term clouds. In particular, we provide an efficient method for dynamically computing the most frequently appearing terms in the posts of monitored online sources, for time intervals specified at query time, without the need to archive the actual published content. An experimental evaluation on a large-scale real-world set of blogs demonstrates the accuracy and the efficiency of the proposed method in terms of computational time and memory requirements.

Explore More