Georgia Koutrika | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Georgia Koutrika is active.

Explore More

Publication

Featured researches published by Georgia Koutrika.

IEEE Transactions on Knowledge and Data Engineering | 2014

Meta-Blocking: Taking Entity Resolutionto the Next Level

George Papadakis; Georgia Koutrika; Themis Palpanas; Wolfgang Nejdl

Entity Resolution is an inherently quadratic task that typically scales to large data collections through blocking. In the context of highly heterogeneous information spaces, blocking methods rely on redundancy in order to ensure high effectiveness at the cost of lower efficiency (i.e., more comparisons). This effect is partially ameliorated by coarse-grained block processing techniques that discard entire blocks either a-priori or during the resolution process. In this paper, we introduce meta-blocking as a generic procedure that intervenes between the creation and the processing of blocks, transforming an initial set of blocks into a new one with substantially fewer comparisons and equally high effectiveness. In essence, meta-blocking aims at extracting the most similar pairs of entities by leveraging the information that is encapsulated in the block-to-entity relationships. To this end, it first builds an abstract graph representation of the original set of blocks, with the nodes corresponding to entity profiles and the edges connecting the co-occurring ones. During the creation of this structure all redundant comparisons are discarded, while the superfluous ones can be removed by pruning of the edges with the lowest weight. We analytically examine both procedures, proposing a multitude of edge weighting schemes, graph pruning algorithms as well as pruning criteria. Our approaches are schema-agnostic, thus accommodating any type of blocks. We evaluate their performance through a thorough experimental study over three large-scale, real-world data sets, with the outcomes verifying significant efficiency enhancements at a negligible cost in effectiveness.

extending database technology | 2013

HIL: a high-level scripting language for entity integration

Mauricio A. Hernández; Georgia Koutrika; Rajasekar Krishnamurthy; Lucian Popa; Ryan Wisnesky

We introduce HIL, a high-level scripting language for entity resolution and integration. HIL aims at providing the core logic for complex data processing flows that aggregate facts from large collections of structured or unstructured data into clean, unified entities. Such flows typically include many stages of processing that start from the outcome of information extraction and continue with entity resolution, mapping and fusion. A HIL program captures the overall integration flow through a combination of SQL-like rules that link, map, fuse and aggregate entities. A salient feature of HIL is the use of logical indexes in its data model to facilitate the modular construction and aggregation of complex entities. Another feature is the presence of a flexible, open type system that allows HIL to handle input data that is irregular, sparse or partially known.n As a result, HIL can accurately express complex integration tasks, while still being high-level and focused on the logical entities (rather than the physical operations). Compilation algorithms translate the HIL specification into efficient run-time queries that can execute in parallel on Hadoop. We show how our framework is applied to real-world integration of entities in the financial domain, based on public filings archived by the U.S. Securities and Exchange Commission (SEC). Furthermore, we apply HIL on a larger-scale scenario that performs fusion of data from hundreds of millions of Twitter messages into tens of millions of structured entities.

Search Computing | 2012

A survey on proximity measures for social networks

Sara Cohen; Benny Kimelfeld; Georgia Koutrika

Measuring proximity in a social network is an important task, with many interesting applications, including person search and link prediction. Person search is the problem of finding, by means of keyword search, relevant people in a social network. In user-centric person search, the search query is issued by a person participating in the social network and the goal is to find people that are relevant not only to the keywords, but also to the searcher herself. Link prediction is the task of predicting new friendships (links) that are likely to be added to the network. Both of these tasks require the ability to measure proximity of nodes within a network, and are becoming increasingly important as social networks become more ubiquitous. n nThis chapter surveys recent work on scoring measures for determining proximity between nodes in a social network. We broadly identify various classes of measures and discuss prominent examples within each class. We also survey efficient implementations for computing or estimating the values of the proximity measures.

very large data bases | 2014

Supervised meta-blocking

George Papadakis; George Papastefanatos; Georgia Koutrika

Entity Resolution matches mentions of the same entity. Being an expensive task for large data, its performance can be improved by blocking, i.e., grouping similar entities and comparing only entities in the same group. Blocking improves the run-time of Entity Resolution, but it still involves unnecessary comparisons that limit its performance. Meta-blocking is the process of restructuring a block collection in order to prune such comparisons. Existing unsupervised meta-blocking methods use simple pruning rules, which offer a rather coarse-grained filtering technique that can be conservative (i.e., keeping too many unnecessary comparisons) or aggressive (i.e., pruning good comparisons). In this work, we introduce supervised meta-blocking techniques that learn classification models for distinguishing promising comparisons. For this task, we propose a small set of generic features that combine a low extraction cost with high discriminatory power. We show that supervised meta-blocking can achieve high performance with small training sets that can be manually created. We analytically compare our supervised approaches with baseline and competitor methods over 10 large-scale datasets, both real and synthetic.

very large data bases | 2015

Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data

George Papadakis; George Alexiou; George Papastefanatos; Georgia Koutrika

Entity Resolution constitutes a core task for data integration that, due to its quadratic complexity, typically scales to large datasets through blocking methods. These can be configured in two ways. The schema-based configuration relies on schema information in order to select signatures of high distinctiveness and low noise, while the schema-agnostic one treats every token from all attribute values as a signature. The latter approach has significant potential, as it requires no fine-tuning by human experts and it applies to heterogeneous data. Yet, there is no systematic study on its relative performance with respect to the schema-based configuration. This work covers this gap by comparing analytically the two configurations in terms of effectiveness, time efficiency and scalability. We apply them to 9 established blocking methods and to 11 benchmarks of structured data. We provide valuable insights into the internal functionality of the blocking methods with the help of a novel taxonomy. Our studies reveal that the schema-agnostic configuration offers unsupervised and robust definition of blocking keys under versatile settings, trading a higher computational cost for a consistently higher recall than the schema-based one. It also enables the use of state-of-the-art blocking methods without schema knowledge.

international conference on data engineering | 2015

LearningAssistant: A novel learning resource recommendation system

Lei Liu; Georgia Koutrika; Shanchan Wu

Reading online content for educational, learning, training or recreational purposes has become a very popular activity. While reading, people may have difficulty understanding a passage or wish to learn more about the topics covered by it, hence they may naturally seek additional or supplementary resources for the particular passage. These resources should be close to the passage both in terms of the subject matter and the reading level. However, using a search engine to find such resources interrupts the reading flow. It is also an inefficient, trial-and-error process because existing web search and recommendation systems do not support large queries, they do not understand semantic topics, and they do not take into account the reading level of the original document a person is reading. In this demo, we present LearningAssistant, a novel system that enables online reading material to be smoothly enriched with additional resources that can supplement or explain any passage from the original material for a reader on demand. The system facilitates the learning process by recommending learning resources (documents, videos, etc) for selected text passages of any length. The recommended resources are ranked based on two criteria (a) how they match the different topics covered within the selected passage, and (b) the reading level of the original text where the selected passage comes from. User feedback from students who use our system in two real pilots, one with a high school and one with a university, for their courses suggest that our system is promising and effective.

international conference on data engineering | 2015

Generating reading orders over document collections

Georgia Koutrika; Lei Liu; Steven J. Simske

Given a document collection, existing systems allow users to browse the collection or perform searches that return lists of documents ranked based on their relevance to the user query. While these approaches work fine when a user is trying to locate specific documents, they are insufficient when users need to access the pertinent documents in some logical order, for example for learning or editorial purposes. We present a system that automatically organizes a collection of documents in a tree from general to more specific documents, and allows a user to choose a reading sequence over the documents. This a novel way to content consumption that departs from the typical ranked lists of documents based on their relevance to a user query and from static navigational interfaces. We present a set of algorithms that solve the problem and we evaluate their performance as well as the reading trees generated.

IEEE Transactions on Knowledge and Data Engineering | 2014

PrefDB: Supporting Preferences as First-Class Citizens in Relational Databases

Anastasios Arvanitis; Georgia Koutrika

In this paper, we argue that preference-aware query processing needs to be pushed closer to the DBMS. We introduce a preference-aware relational data model that extends database tuples with preferences and an extended algebra that captures the essence of processing queries with preferences. Based on a set of algebraic properties and a cost model that we propose, we provide several query optimization strategies for extended query plans. Further, we describe a query execution algorithm that blends preference evaluation with query execution, while making effective use of the native query engine. We have implemented our framework and methods in a prototype system, PrefDB. PrefDB allows transparent and efficient evaluation of preferential queries on top of a relational DBMS. Our extensive experimental evaluation on two real-world datasets demonstrates the feasibility and advantages of our framework.

engineering interactive computing system | 2015

To print or not to print: hybrid learning with METIS learning platform

Joshua M. Hailpern; Rares Vernica; Molly Bullock; Udi Chatow; Jian Fan; Georgia Koutrika; Jerry Liu; Lei Liu; Steven J. Simske; Shanchan Wu

As part of the explosion in educational software, online tools, and open educational resources there has been a rapid devaluation of printed textbooks. While digital texts have advantages, printed textbooks still provide irreplaceable value over online media. Therefore technology should enhance, rather than eliminate printed text. To this end, this paper presents METIS, a hybrid learning software/service platform that is designed to support active reading. METIS provides easy digital-to-print-to-digital usage, simple creation of Cheat Sheets & FlexNotes for personal note taking and organization, and a custom flexible rendering & publishing engine for education called Aero. METIS was designed based on lessons learned from a formative study of 523 students at SJSU, and validated through focus groups involving 32 educators and students at both high school and college levels.

international conference on data engineering | 2015

Goals in Social Media, information retrieval and intelligent agents

Dimitra Papadimitriou; Yannis Velegrakis; Georgia Koutrika; John Mylopoulos

This tutorial provides a comprehensive and cohesive overview of goal modeling and recognition approaches by the Information Retrieval, the Artificial Intelligence and the Social Media communities. We will examine how these fields restrict the domain of study and how they capture notions easily perceived by humans intuition but difficult to be formally defined and handled algorithmically. It is the purpose of this tutorial to provide a solid framework for placing existing work into perspective and highlight critical open challenges that will act as a springboard for researchers and practitioners in database systems, social data, and the Web, as well as developers of web-based, database-driven, and social applications, to work towards more user-centric systems and applications.

Explore More