Neoklis Polyzotis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Neoklis Polyzotis is active.

Explore More

Publication

Featured researches published by Neoklis Polyzotis.

international conference on management of data | 2012

CrowdScreen: algorithms for filtering data with humans

Aditya G. Parameswaran; Hector Garcia-Molina; Hyunjung Park; Neoklis Polyzotis; Aditya Ramesh; Jennifer Widom

Given a large set of data items, we consider the problem of filtering them based on a set of properties that can be verified by humans. This problem is commonplace in crowdsourcing applications, and yet, to our knowledge, no one has considered the formal optimization of this problem. (Typical solutions use heuristics to solve the problem.) We formally state a few different variants of this problem. We develop deterministic and probabilistic algorithms to optimize the expected cost (i.e., number of questions) and expected error. We experimentally show that our algorithms provide definite gains with respect to other strategies. Our algorithms can be applied in a variety of crowdsourcing scenarios and can form an integral part of any query processor that uses human computation.

international conference on management of data | 2002

Statistical synopses for graph-structured XML databases

Neoklis Polyzotis; Minos N. Garofalakis

Effective support for XML query languages is becoming increasingly important with the emergence of new applications that access large volumes of XML data. All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows path navigation and branching through the XML data graph in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, We introduce a novel approach to building and using statistical summaries of large XML data graphs for effective path-expression selectivity estimation. Our proposed graph-synopsis model (termed XSKETCH) exploits localized graph stability to accurately approximate (in limited space) the path and branching distribution in the data graph. To estimate the selectivities of complex path expressions over concise XSKETCH synopses, we develop an estimation framework that relies on appropriate statistical (uniformity and independence) assumptions to compensate for the lack of detailed distribution information. Given our estimation framework, we demonstrate that the problem of building an accuracy-optimal XSKETCH for a given amount of space is 𝒩𝒫-hard, and propose an efficient heuristic algorithm based on greedy forward selection. Briefly, our algorithm constructs an XSKETCH synopsis by successive refinements of the label-split graph, the coarsest summary of the XML data graph. Our refinement operations act locally and attempt to capture important statistical correlations between data paths. Extensive experimental results with synthetic as well as real-life data sets verify the effectiveness of our approach. To the best of our knowledge, ours is the first work to address this timely problem in the most general setting of graph-structured data and complex (branching) path expressions.

international conference on management of data | 2004

Approximate XML query answers

Neoklis Polyzotis; Minos N. Garofalakis; Yannis E. Ioannidis

The rapid adoption of XML as the standard for data representation and exchange foreshadows a massive increase in the amounts of XML data collected, maintained, and queried over the Internet or in large corporate data-stores. Inevitably, this will result in the development of on-line decision support systems, where users and analysts interactively explore large XML data sets through a declarative query interface (e.g., XQuery or XSLT). Given the importance of remaining interactive, such on-line systems can employ approximate query answers as an effective mechanism for reducing response time and providing users with early feedback. This approach has been successfully used in relational systems and it becomes even more compelling in the XML world, where the evaluation of complex queries over massive tree-structured data is inherently more expensive.In this paper, we initiate a study of approximate query answering techniques for large XML databases. Our approach is based on a novel, conceptually simple, yet very effective XML-summarization mechanism: TREESKETCH synopses. We demonstrate that, unlike earlier techniques focusing solely on selectivity estimation, our TREESKETCH synopses are much more effective in capturing the complete tree structure of the underlying XML database. We propose novel construction algorithms for building TREESKETCH summaries of limited size, and describe schemes for processing general XML twig queries over a concise TREESKETCH in order to produce very fast, approximate tree-structured query answers. To quantify the quality of such approximate answers, we propose a novel, intuitive error metric that captures the quality of the approximation in terms of both the overall structure of the XML tree and the distribution of document edges. Experimental results on real-life and synthetic data sets verify the effectiveness of our TREESKETCH synopses in producing fast, accurate approximate answers and demonstrate their benefits over previously proposed techniques that focus solely on selectivity estimation. In particular, TREESKETCHes yield faster, more accurate approximate answers and selectivity estimates, and are more efficient to construct. To the best of our knowledge, ours is the first work to address the timely problem of producing fast, approximate tree-structured answers for complex XML queries.

very large data bases | 2002

Structure and value synopses for XML data graphs

Neoklis Polyzotis; Minos N. Garofalakis

All existing proposals for querying XML (e.g., XQuery) rely on a pattern-specification language that allows (1) path navigation and branching through the label structure of the XML data graph, and (2) predicates on the values of specific path/branch nodes, in order to reach the desired data elements. Optimizing such queries depends crucially on the existence of concise synopsis structures that enable accurate compile-time selectivity estimates for complex path expressions over graph-structured XML data. In this paper, we extent our earlier work on structural XSKETCH synopses and we propose an (augmented) XSKETCH synopsis model that exploits localized stability and value-distribution summaries (e.g., histograms) to accurately capture the complex correlation patterns that can exist between and across path structure and element values in the data graph. We develop a systematic XSKETCH estimation framework for complex path expressions with value predicates and we propose an efficient heuristic algorithm based on greedy forward selection for building an effective XSKETCH for a given amount of space (which is, in general, an NP-hard optimization problem). Implementation results with both synthetic and real-life data sets verify the effectiveness of our approach.

statistical and scientific database management | 2009

Query Recommendations for Interactive Database Exploration

Gloria Chatzopoulou; Magdalini Eirinaki; Neoklis Polyzotis

Relational database systems are becoming increasingly popular in the scientific community to support the interactive exploration of large volumes of data. In this scenario, users employ a query interface (typically, a web-based client) to issue a series of SQL queries that aim to analyze the data and mine it for interesting information. First-time users, however, may not have the necessary knowledge to know where to start their exploration. Other times, users may simply overlook queries that retrieve important information. To assist users in this context, we draw inspiration from Web recommender systems and propose the use of personalized query recommendations. The idea is to track the querying behavior of each user, identify which parts of the database may be of interest for the corresponding data analysis task, and recommend queries that retrieve relevant data. We discuss the main challenges in this novel application of recommendation systems, and outline a possible solution based on collaborative filtering. Preliminary experimental results on real user traces demonstrate that our framework can generate effective query recommendations.

ieee international conference on high performance computing data and analytics | 2011

SciHadoop: array-based query processing in Hadoop

Joe B. Buck; Noah Watkins; Jeff LeFevre; Kleoni Ioannidou; Carlos Maltzahn; Neoklis Polyzotis; Scott A. Brandt

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoops byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci- Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.

international conference on data engineering | 2007

Supporting Streaming Updates in an Active Data Warehouse

Neoklis Polyzotis; Spiros Skiadopoulos; Panos Vassiliadis; Alkis Simitsis; Nils-Erik Frantzell

Active data warehousing has emerged as an alternative to conventional warehousing practices in order to meet the high demand of applications for up-to-date information. In a nutshell, an active warehouse is refreshed on-line and thus achieves a higher consistency between the stored information and the latest data updates. The need for on-line warehouse refreshment introduces several challenges in the implementation of data warehouse transformations, with respect to their execution time and their overhead to the warehouse processes. In this paper, we focus on a frequently encountered operation in this context, namely, the join of a fast stream S of source updates with a disk-based relation R, under the constraint of limited memory. This operation lies at the core of several common transformations, such as, surrogate key assignment, duplicate detection or identification of newly inserted tuples. We propose a specialized join algorithm, termed mesh join (MeshJoin), that compensates for the difference in the access cost of the two join inputs by (a) relying entirely on fast sequential scans of R, and (b) sharing the I/O cost of accessing R across multiple tuples of S. We detail the Mesh Join algorithm and develop a systematic cost model that enables the tuning of Mesh Join for two objectives: maximizing throughput under a specific memory budget or minimizing memory consumption for a specific throughput. We present an experimental study that validates the performance of Mesh Join on synthetic and real-life data. Our results verify the scalability of Mesh-Join to fast streams and large relations, and demonstrate its numerous advantages over existing join algorithms.

very large data bases | 2009

Autocompletion for mashups

Ohad Greenshpan; Tova Milo; Neoklis Polyzotis

A mashup is a Web application that integrates data, computation and UI elements provided by several components into a single tool. The concept originated from the understanding that there is an increasing number of applications available on the Web and a growing need to combine them in order to meet user requirements. This paper presents MatchUp, a system that supports rapid, on-demand, intuitive development of mashups, based on a novel autocompletion mechanism. The key observation guiding the development of MatchUp is that mashups developed by different users typically share common characteristics; they use similar classes of mashup components and glue them together in a similar manner. MatchUp exploits these similarities to recommend useful completions (missing components and connections between them) for a users partial mashup specification. The user is presented with a ranking of the recommendations from which she can choose and refine according to her needs. This paper presents the data model and ranking metric underlying our novel autocompletion mechanism. It introduces an efficient top-k ranking algorithm that is at the core of the MatchUp system and that is formally proved to be optimal in some natural sense. We also experimentally demonstrate the efficiency of our algorithm and the effectiveness of our proposal for rapid mashup construction.

international world wide web conferences | 2012

Max algorithms in crowdsourcing environments

Petros Venetis; Hector Garcia-Molina; Kerui Huang; Neoklis Polyzotis

Our work investigates the problem of retrieving the maximum item from a set in crowdsourcing environments. We first develop parameterized families of max algorithms, that take as input a set of items and output an item from the set that is believed to be the maximum. Such max algorithms could, for instance, select the best Facebook profile that matches a given person or the best photo that describes a given restaurant. Then, we propose strategies that select appropriate max algorithm parameters. Our framework supports various human error and cost models and we consider many of them for our experiments. We evaluate under many metrics, both analytically and via simulations, the tradeoff between three quantities: (1) quality, (2) monetary cost, and (3) execution time. Also, we provide insights on the effectiveness of the strategies in selecting appropriate max algorithm parameters and guidelines for choosing max algorithms and strategies for each application.

international conference on management of data | 2006

COLT: continuous on-line tuning

Karl Schnaitter; Serge Abiteboul; Tova Milo; Neoklis Polyzotis

The physical schema of a database plays a critical role in performance. Self-tuning is a cost-effective and elegant solution to optimize the physical configuration for the characteristics of the query load. Existing techniques operate in an off-line fashion, by choosing a fixed configuration that is tailored to a subset of the query load. The generated configurations therefore ignore any temporal patterns that may exist in the actual load submitted to the system.This demonstration introduces COLT (Continuous On-Line Tuning), a novel self-tuning framework that continuously monitors the incoming queries and adjusts the system configuration in order to maximize query performance. The key idea behind COLT is to gather performance statistics at different levels of detail and to carefully allocate profiling resources to the most promising candidate configurations. Moreover, COLT uses effective heuristics to regulate its own performance, lowering its overhead when the system is well-tuned, and being more aggressive when the workload shifts and it becomes necessary to re-tune the system. We present a specialization of COLT to the important problem of selecting an effective set of relational indices for the current query load. Our demonstration will use an implementation of our proposed framework in the PostgreSQL database system, showing the internal operation of COLT and the adaptive selection of indices as we vary the query load of the server.

Explore More