Nikolaos Papailiou
National Technical University of Athens
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Nikolaos Papailiou.
international world wide web conferences | 2012
Nikolaos Papailiou; Ioannis Konstantinou; Dimitrios Tsoumakos; Nectarios Koziris
In this work we present H2RDF, a fully distributed RDF store that combines the MapReduce processing framework with a NoSQL distributed data store. Our system features two unique characteristics that enable efficient processing of both simple and multi-join SPARQL queries on virtually unlimited number of triples: Join algorithms that execute joins according to query selectivity to reduce processing; and adaptive choice among centralized and distributed (MapReduce-based) join execution for fast query responses. Our system efficiently answers both simple joins and complex multivariate queries and easily scales to 3 billion triples using a small cluster of 9 worker nodes. H2RDF outperforms state-of-the-art distributed solutions in multi-join and nonselective queries while achieving comparable performance to centralized solutions in selective queries. In this demonstration we showcase the systems functionality through an interactive GUI. Users will be able to execute predefined or custom-made SPARQL queries on datasets of different sizes, using different join algorithms. Moreover, they can repeat all queries utilizing a different number of cluster resources. Using real-time cluster monitoring and detailed statistics, participants will be able to understand the advantages of different execution schemes versus the input data as well as the scalability properties of H2RDF over both the data size and the available worker resources.
international conference on big data | 2013
Nikolaos Papailiou; Ioannis Konstantinou; Dimitrios Tsoumakos; Panagiotis Karras; Nectarios Koziris
The proliferation of data in RDF format calls for efficient and scalable solutions for their management. While scalability in the era of big data is a hard requirement, modern systems fail to adapt based on the complexity of the query. Current approaches do not scale well when faced with substantially complex, non-selective joins, resulting in exponential growth of execution times. In this work we present H2RDF+, an RDF store that efficiently performs distributed Merge and Sort-Merge joins over a multiple index scheme. H2RDF+ is highly scalable, utilizing distributed MapReduce processing and HBase indexes. Utilizing aggressive byte-level compression and result grouping over fast scans, it can process both complex and selective join queries in a highly efficient manner. Furthermore, it adaptively chooses for either single- or multi-machine execution based on join complexity estimated through index statistics. Our extensive evaluation demonstrates that H2RDF+ efficiently answers non-selective joins an order of magnitude faster than both current state-of-the-art distributed and centralized stores, while being only tenths of a second slower in simple queries, scaling linearly to the amount of available resources.
international conference on management of data | 2014
Nikolaos Papailiou; Dimitrios Tsoumakos; Ioannis Konstantinou; Panagiotis Karras; Nectarios Koziris
The proliferation of data in RDF format has resulted in the emergence of a plethora of specialized management systems. While the ability to adapt to the complexity of a SPARQL query -- given their inherent diversity -- is crucial, current approaches do not scale well when faced with substantially complex, non-selective joins, resulting in exponential growth of execution times. In this demonstration we present H2 RDF+, an RDF store that efficiently performs distributed Merge and Sort-Merge joins using a multiple-index scheme over HBase indexes. Through a greedy planner that incorporates our cost-model, it adaptively commands for either single or multi-machine query execution based on join complexity. In this paper, we present its key scientific contributions and allow participants to interact with an H2RDF+ deployment over a Cloud infrastructure. Using a web-based GUI we allow users to load different datasets (both real and synthetic), apply any query (custom or predefined) and monitor its execution. By allowing real-time inspection of cluster status, response times and committed resources the audience will evaluate the validity of H2RDF+s claims and perform direct comparisons to two other state-of-the-art RDF stores.
international conference on management of data | 2015
Nikolaos Papailiou; Dimitrios Tsoumakos; Panagiotis Karras; Nectarios Koziris
The pace at which data is described, queried and exchanged using the RDF specification has been ever increasing with the proliferation of Semantic Web. Minimizing SPARQL query response times has been an open issue for the plethora of RDF stores, yet SPARQL result caching techniques have not been extensively utilized. In this work we present a novel system that addresses graph-based, workload-adaptive indexing of large RDF graphs by caching SPARQL query results. At the heart of the system lies a SPARQL query canonical labelling algorithm that is used to uniquely index and reference SPARQL query graphs as well as their isomorphic forms. We integrate our canonical labelling algorithm with a dynamic programming planner in order to generate the optimal join execution plan, examining the utilization of both primitive triple indexes and cached query results. By monitoring cache requests, our system is able to identify and cache SPARQL queries that, even if not explicitly issued, greatly reduce the average response time of a workload. The proposed cache is modular in design, allowing integration with different RDF stores. Incorporating it to an open-source, distributed RDF engine that handles large scale RDF datasets, we prove that workload-adaptive caching can reduce average response times by up to two orders of magnitude and offer interactive response times for complex workloads and huge RDF datasets.
international conference on big data | 2014
Ioannis Giannakopoulos; Nikolaos Papailiou; Christos Mantas; Ioannis Konstantinou; Dimitrios Tsoumakos; Nectarios Koziris
One of the main promises of the cloud computing paradigm is the ability to scale resources on-demand. This feature characterizes the cloud era, where the overhead of early expenditure for infrastructure is eliminated. Innovative services are thus able to enter the market quicker and adopt faster to new challenges and user demand. One of the main aspects of this on-demand nature is the concept of elasticity, i.e., the ability of autonomously provision and de-provision resources by reacting to changes in the incoming load. An elastic service is able to operate with an optimal cost by expanding and contracting its used resources at runtime and according to demand. This does not only minimizes running cost, but also avoids disruptive outages due to spikes in service usage. While the various layers comprising a cloud service can be scaled, this does not happen in a unified manner. The vision of CELAR is to provide a fully integrated software stack that manages resource allocation for cloud applications in an autonomous, efficient and generic manner. In order to achieve that, CELAR incorporates novel methodologies for describing cloud applications, monitoring the use of various resources, evaluating cost, taking informed decisions and interacting with the underlying cloud infrastructure. Our goal is two-fold. On the one hand is developing the methodologies for achieving multi-grained, automatic elasticity control on both application and infrastructure level. On the other hand is developing the open-source tools that implement those methods in an integrated manner. Hereby we present an overview of the CELAR platform, explaining its architectural components and some basic workflows that show how they interact in order to achieve the core functionalities.
international conference on big data | 2016
Katerina Doka; Nikolaos Papailiou; Victor Giannakouris; Dimitrios Tsoumakos; Nectarios Koziris
Current platforms fail to efficiently cope with the data and task heterogeneity of modern analytics workflows due to their adhesion to a single data and/or compute model. As a remedy, we present IReS, the Intelligent Resource Scheduler for complex analytics workflows executed over multi-engine environments. IReS is able to optimize a workflow with respect to a user-defined policy relying on cost and performance models of the required tasks over the available platforms. This optimization consists in allocating distinct workflow parts to the most advantageous execution and/or storage engine among the available ones and deciding on the exact amount of resources provisioned. Our current prototype supports 5 compute and 3 data engines, yet new ones can effortlessly be added to IReS by virtue of its engine-agnostic mechanisms. Our extensive experimental evaluation confirms that IReS speeds up diverse and realistic workflows by up to 30% compared to their optimal single-engine plan by automatically scattering parts of them to different execution engines and datastores. Its optimizer incurs only marginal overhead to the workflow execution performance, managing to discover the optimal execution plan within a few seconds, even for large-scale workflow instances.
Future Internet | 2014
Elena Demidova; Nicola Barbieri; Stefan Dietze; Adam Funk; Helge Holzmann; Diana Maynard; Nikolaos Papailiou; Wim Peters; Thomas Risse; Dimitris Spiliotopoulos
The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results.
ieee international conference on cloud engineering | 2015
Ioannis Giannakopoulos; Dimitrios Tsoumakos; Nikolaos Papailiou; Nectarios Koziris
In this work we address the problem of predicting the performance of a complex application deployed over virtualized resources. Cloud computing has enabled numerous companies to develop and deploy their applications over cloud infrastructures for a wealth of reasons including (but not limited to) decrease costs, avoid administrative effort, rapidly allocate new resources, etc. Virtualization however, adds an extra layer in the software stack, hardening the prediction of the relation between the resources and the application performance, which is a key factor for every industry. To address this challenge we propose PANIC, a system which obtains knowledge for the application by actually deploying it over a cloud infrastructure and then, approximating the performance of the application for the all possible deployment configurations. The user of PANIC defines a set of resources along with their respective ranges and then the system samples the deployment space formed by all the combinations of the resources, deploys the application in some representative points and utilizes a wealth of approximation techniques to predict the behavior of the application in the remainder space. The experimental evaluation has indicated that a small portion of the possible deployment configurations is enough to create profiles with high accuracy for three real world applications.
Future Internet | 2014
Thomas Risse; Elena Demidova; Stefan Dietze; Wim Peters; Nikolaos Papailiou; Katerina Doka; Yannis Stavrakas; Vassilis Plachouras; Pierre Senellart; Florent Carpentier; Amin Mantrach; Bogdan Cautis; Patrick Siehndel; Dimitris Spiliotopoulos
The constantly growing amount ofWeb content and the success of the SocialWeb lead to increasing needs for Web archiving. These needs go beyond the pure preservationo of Web pages. Web archives are turning into “community memories” that aim at building a better understanding of the public view on, e.g., celebrities, court decisions and other events. Due to the size of the Web, the traditional “collect-all” strategy is in many cases not the best method to build Web archives. In this paper, we present the ARCOMEM (From Future Internet 2014, 6 689 Collect-All Archives to Community Memories) architecture and implementation that uses semantic information, such as entities, topics and events, complemented with information from the Social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts.
international conference on big data | 2016
Victor Giannakouris; Nikolaos Papailiou; Dimitrios Tsoumakos; Nectarios Koziris
Multi-engine analytics has been gaining an increasing amount of attention from both the academic and the industrial community as it can successfully cope with the heterogeneity and complexity that the plethora of frameworks, technologies and requirements have brought forth. It is now common for a data analyst to combine data that resides on multiple and totally independent engines and perform complex analytics queries. Multi-engine solutions based on SQL can facilitate such efforts, as SQL is a popular standard that the majority of data-scientists understands. Existing solutions propose a middleware that centrally optimizes query execution for multiple engines. Yet, this approach requires manual integration of every primitive engine operator along with its cost model, rendering the process of adding new operators or engines highly inextensible. To address this issue we present MuSQLE, a system for SQL-based analytics over multi-engine environments. MuSQLE can efficiently utilize external SQL engines allowing for both intra and inter engine optimizations. Our framework adopts a novel API-based strategy. Instead of manual integration, MuSQLE specifies a generic API, used for the cost estimation and query execution, that needs to be implemented for each SQL engine endpoint. Our engine API is integrated with a state-of-the-art query optimizer, adding support for location-based, multi-engine query optimization and letting individual runtimes perform sub-query physical optimization. The derived multi-engine plans are executed using the Spark distributed execution framework. Our detailed experimental evaluation, integrating PostgreSQL, MemSQL and SparkSQL under MuSQLE, demonstrates its ability to accurately decide on the most suitable execution engine. MuSQLE can provide speedups of up to 1 order of magnitude for TPCH queries, leveraging different engines for the execution of individual query parts.