Yannis Papakonstantinou

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yannis Papakonstantinou is active.

Explore More

Publication

Featured researches published by Yannis Papakonstantinou.

extending database technology | 1998

Fusion Queries over Internet Databases

Ramana Yerneni; Yannis Papakonstantinou; Serge Abiteboul; Hector Garcia-Molina

Fusion queries search for information integrated from distributed, autonomous sources over the Internet. We investigate techniques for efficient processing of fusion queries. First, we focus on a very wide class of query plans that capture the spirit of many techniques usually considered in existing systems. We show how to efficiently find good query plans within this large class. We provide additional heuristics that, by considering plans outside our target class of plans, yield further performance improvements.

international conference on management of data | 2017

An Experimental Study of Bitmap Compression vs. Inverted List Compression

Jianguo Wang; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson

Bitmap compression has been studied extensively in the database area and many efficient compression schemes were proposed, e.g., BBC, WAH, EWAH, and Roaring. Inverted list compression is also a well-studied topic in the information retrieval community and many inverted list compression algorithms were developed as well, e.g., VB, PforDelta, GroupVB, Simple8b, and SIMDPforDelta. We observe that they essentially solve the same problem, i.e., how to store a collection of sorted integers with as few as possible bits and support query processing as fast as possible. Due to historical reasons, bitmap compression and inverted list compression were developed as two separated lines of research in the database area and information retrieval area. Thus, a natural question is: Which one is better between bitmap compression and inverted list compression? To answer the question, we present the first comprehensive experimental study to compare a series of 9 bitmap compression methods and 12 inverted list compression methods. We compare these 21 algorithms on synthetic datasets with different distributions (uniform, zipf, and markov) as well as 8 real-life datasets in terms of the space overhead, decompression time, intersection time, and union time. Based on the results, we provide many lessons and guidelines that can be used for practitioners to decide which technique to adopt in future systems and also for researchers to develop new algorithms.

very large data bases | 2016

HippogriffDB: balancing I/O and GPU bandwidth in big data analytics

Jing Li; Hung-Wei Tseng; Chunbin Lin; Yannis Papakonstantinou; Steven Swanson

As data sets grow and conventional processor performance scaling slows, data analytics move towards heterogeneous architectures that incorporate hardware accelerators (notably GPUs) to continue scaling performance. However, existing GPU-based databases fail to deal with big data applications efficiently: their execution model suffers from scalability limitations on GPUs whose memory capacity is limited; existing systems fail to consider the discrepancy between fast GPUs and slow storage, which can counteract the benefit of GPU accelerators. n nIn this paper, we propose HippogriffDB, an efficient, scalable GPU-accelerated OLAP system. It tackles the bandwidth discrepancy using compression and an optimized data transfer path. HippogriffDB stores tables in a compressed format and uses the GPU for decompression, trading GPU cycles for the improved I/O bandwidth. To improve the data transfer efficiency, HippogriffDB introduces a peer-to-peer, multi-threaded data transfer mechanism, directly transferring data from the SSD to the GPU. HippogriffDB adopts a query-over-block execution model that provides scalability using a stream-based approach. The model improves kernel efficiency with the operator fusion and double buffering mechanism. n nWe have implemented HippogriffDB using an NVMe SSD, which talks directly to a commercial GPU. Results on two popular benchmarks demonstrate its scalability and efficiency. HippogriffDB outperforms existing GPU-based databases (YDB) and in-memory data analytics (MonetDB) by 1-2 orders of magnitude.

very large data bases | 2017

MILC: inverted list compression in memory

Jianguo Wang; Chunbin Lin; Ruining He; Moojin Chae; Yannis Papakonstantinou; Steven Swanson

Inverted list compression is a topic that has been studied for 50 years due to its fundamental importance in numerous applications including information retrieval, databases, and graph analytics. Typically, an inverted list compression algorithm is evaluated on its space overhead and query processing time. Earlier list compression designs mainly focused on minimizing the space overhead to reduce expensive disk I/O time in disk-oriented systems. But the recent trend is shifted towards reducing query processing time because the underlying systems tend to be memory-resident. Although there are many highly optimized compression approaches in main memory, there is still a considerable performance gap between query processing over compressed lists and uncompressed lists, which motivates this work. n nIn this work, we set out to bridge this performance gap for the first time by proposing a new compression scheme, namely, MILC (memory inverted list compression). MILC relies on a series of techniques including offset-oriented fixed-bit encoding, dynamic partitioning, in-block compression, cache-aware optimization, and SIMD acceleration. We conduct experiments on three real-world datasets in information retrieval, databases, and graph analytics to demonstrate the high performance and low space overhead of MILC. We compare MILC with 12 recent compression algorithms and experimentally show that MILC improves the query performance by up to 13.2× and reduces the space overhead by up to 4.7×.

international conference on management of data | 2017

Waldo: An Adaptive Human Interface for Crowd Entity Resolution

Vasilis Verroios; Hector Garcia-Molina; Yannis Papakonstantinou

In Entity Resolution, the objective is to find which records of a dataset refer to the same real-world entity. Crowd Entity Resolution uses humans, in addition to machine algorithms, to improve the quality of the outcome. We study a hybrid approach that combines two common interfaces for human tasks in Crowd Entity Resolution, taking into account key observations about the advantages and disadvantages of the two interfaces. We give a formal definition to the problem of human task selection and we derive algorithms with strong optimality guarantees. Our experiments with four real-world datasets show that our hybrid approach gives an improvement of 50% to 300% in the crowd cost to resolve a dataset, compared to using a single interface.

very large data bases | 2016

Fast in-memory SQL analytics on typed graphs

Chunbin Lin; Benjamin Mandel; Yannis Papakonstantinou; Matthias Springer

We study a class of graph analytics SQL queries, which we call relationship queries. These queries involving aggregation, join, semijoin, intersection and selection are a wide superset of fixed-length graph reachability queries and of tree pattern queries. We present real-world OLAP scenarios, where efficient relationship queries are needed. However, row stores, column stores and graph databases are unacceptably slow in such OLAP scenarios. n nWe propose a GQ-Fast database, which is an indexed database that roughly corresponds to efficient encoding of annotated adjacency lists that combines salient features of column-based organization, indexing and compression. GQ-Fast uses a bottom-up fully pipelined query execution model, which enables (a) aggressive compression (e.g., compressed bitmaps and Huffman) and (b) avoids intermediate results that consist of row IDs (which are typical in column databases). GQ-Fast compiles query plans into executable C++ source code. Besides achieving runtime efficiency, GQ-Fast also reduces main memory requirements because, unlike column databases, GQ-Fast selectively allows dense forms of compression including heavy-weight compressions, which do not support random access. n nWe used GQ-Fast to accelerate queries for two OLAP dashboards in the biomedical field. GQ-Fast outperforms PostgreSQL by 2--4 orders of magnitude and MonetDB, Vertica and Neo4j by 1--3 orders of magnitude when all of them are running on RAM. Our experiments dissect GQ-Fasts advantage between (i) the use of compiled code, (ii) the bottom-up pipelining execution strategy, and (iii) the use of dense structures. Other analysis and experiments show the space savings of GQ-Fast due to the appropriate use of compression methods. We also show that the runtime penalty incurred by the dense compression methods decreases as the number of CPU cores increases.

international conference on data engineering | 2017

GQFast: Fast Graph Exploration with Context-Aware Autocompletion

Chunbin Lin; Jianguo Wang; Yannis Papakonstantinou

There is an increasing demand to explore similar entities in big graphs. For example, in domains like biomedical science, identifying similar entities may contribute to developing new drugs or discovering new diseases. In this paper, we demonstrate a graph exploration system, called GQFast, which provides a graphical interface to help users efficiently explore similar entities. Methodologically, GQFast first builds efficient indices combining column database optimizations and compression techniques, then it explores similar entities by using the indices. GQFast operates on the real-world Pubmed dataset consisting of over 23 million biomedical entities and 1.3 billion relationships. Relying on GQFasts high performance, GQFast provides (i) type-ahead-search to instantly visualize search results while a user is typing a query, and (ii) context-aware query completion to guide users typing queries.

cooperative and human aspects of software engineering | 2017

Big data techniques for public health: a case study

Yannis Katsis; Natasha Balac; Derek A. Chapman; Madhur Kapoor; Jessica Block; William G. Griswold; Jeannie S. Huang; Nikos Koulouris; Massimiliano Menarini; Viswanath Nandigam; Mandy Ngo; Kian Win Ong; Yannis Papakonstantinou; Besa Smith; Konstantinos Zarifis; Steven H. Woolf; Kevin Patrick

Public health researchers increasingly recognize that to advance their field they must grapple with the availability of increasingly large (i.e., thousands of variables) traditional population-level datasets (e.g., electronic medical records), while at the same time integrating additional large datasets (e.g., data on genomics, the microbiome, environmental exposures, socioeconomic factors, and health behaviors). Leveraging these multiple forms of data might well provide unique and unexpected discoveries about the determinants of health and wellbeing. However, we are in the very early stages of advancing the techniques required to understand and analyze big population-level data for public health research. To address this problem, this paper describes how we propose that big data can be efficiently used for public health discoveries. We show that data analytics techniques traditionally employed in public health studies are not up to the task of the data we now have in hand. Instead we present techniques adapted from big data visualization and analytics approaches used in other domains that can be used to answer important public health questions utilizing these existing and new datasets. Our findings are based on an exploratory big data case study carried out in San Diego County, California where we analyzed thousands of variables related to health to gain interesting insights on the determinants of several health outcomes, including life expectancy and anxiety disorders. These findings provide a promising early indication that public health research will benefit from the larger set of activities in contemporary big data research.

Proceedings of the International Symposium on Memory Systems | 2017

Improving SSD lifetime with byte-addressable metadata

Yanqin Jin; Hung-Wei Tseng; Yannis Papakonstantinou; Steven Swanson

Existing solid state drives (SSDs) provide flash-based out-of-band (OOB) data that can only be updated on a page write. Consequently, the metadata stored in their OOB region lack flexibility due to the idiosyncrasies of flash memory, incurring unnecessary flash write operations detrimental to device lifetime. We propose PebbleSSD, an SSD with byte-addressable metadata, or BAM, as a mechanism exploiting the non-volatile, byte-addressable random access memory (NVRAM) inside the SSD. With BAM, PebbleSSD can support a range of useful features to improve its lifetime by reducing redundant flash writes. Specifically, PebbleSSD supports a write-optimized, BAM-based file block mapping to prevent excessive updates of file system index blocks. Furthermore, PebbleSSD allows log-structured file systems to perform fast and efficient log cleaning with minimal flash writes. We have implemented a prototype of PebbleSSD on a commercial SSD development platform, and experimental results demonstrate that PebbleSSD can reduce the amount of data written by log-structured file systems during log cleaning by up to 99%, and PebbleSSDs BAM-based file block mapping can reduce flash writes by up to 33% for a number of workloads.

conference on innovative data systems research | 2015