Herodotos Herodotou
Duke University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Herodotos Herodotou.
symposium on cloud computing | 2011
Herodotos Herodotou; Fei Dong; Shivnath Babu
Infrastructure-as-a-Service (IaaS) cloud platforms have brought two unprecedented changes to cluster provisioning practices. First, any (nonexpert) user can provision a cluster of any size on the cloud within minutes to run her data-processing jobs. The user can terminate the cluster once her jobs complete, and she needs to pay only for the resources used and duration of use. Second, cloud platforms enable users to bypass the traditional middleman---the system administrator---in the cluster-provisioning process. These changes give tremendous power to the user, but place a major burden on her shoulders. The user is now faced regularly with complex cluster sizing problems that involve finding the cluster size, the type of resources to use in the cluster from the large number of choices offered by current IaaS cloud platforms, and the job configurations that best meet the performance needs of her workload. In this paper, we introduce the Elastisizer, a system to which users can express cluster sizing problems as queries in a declarative fashion. The Elastisizer provides reliable answers to these queries using an automated technique that uses a mix of job profiling, estimation using black-box and white-box models, and simulation. We have prototyped the Elastisizer for the Hadoop MapReduce framework, and present a comprehensive evaluation that shows the benefits of the Elastisizer in common scenarios where cluster sizing problems arise.
very large data bases | 2012
Harold Lim; Herodotos Herodotou; Shivnath Babu
There is a growing trend of performing analysis on large datasets using workflows composed of MapReduce jobs connected through producer-consumer relationships based on data. This trend has spurred the development of a number of interfaces---ranging from program-based to query-based interfaces---for generating MapReduce workflows. Studies have shown that the gap in performance can be quite large between optimized and unoptimized workflows. However, automatic cost-based optimization of MapReduce workflows remains a challenge due to the multitude of interfaces, large size of the execution plan space, and the frequent unavailability of all types of information needed for optimization. We introduce a comprehensive plan space for MapReduce workflows generated by popular workflow generators. We then propose Stubby, a cost-based optimizer that searches selectively through the subspace of the full plan space that can be enumerated correctly and costed based on the information available in any given setting. Stubby enumerates the plan space based on plan-to-plan transformations and an efficient search algorithm. Stubby is designed to be extensible to new interfaces and new types of optimizations, which is a desirable feature given how rapidly MapReduce systems are evolving. Stubbys efficiency and effectiveness have been evaluated using representative workflows from many domains.
international conference on management of data | 2011
Herodotos Herodotou; Nedyalko Borisov; Shivnath Babu
Table partitioning splits a table into smaller parts that can be accessed, stored, and maintained independent of one another. From their traditional use in improving query performance, partitioning strategies have evolved into a powerful mechanism to improve the overall manageability of database systems. Table partitioning simplifies administrative tasks like data loading, removal, backup, statistics maintenance, and storage provisioning. Query language extensions now enable applications and user queries to specify how their results should be partitioned for further use. However, query optimization techniques have not kept pace with the rapid advances in usage and user control of table partitioning. We address this gap by developing new techniques to generate efficient plans for SQL queries involving multiway joins over partitioned tables. Our techniques are designed for easy incorporation into bottom-up query optimizers that are in wide use today. We have prototyped these techniques in the PostgreSQL optimizer. An extensive evaluation shows that our partition-aware optimization techniques, with low optimization overhead, generate plans that can be an order of magnitude better than plans produced by current optimizers.
PLOS ONE | 2011
Faheem Mitha; Herodotos Herodotou; Nedyalko Borisov; Chen Jiang; Josh Yoder; Kouros Owzar
Background We describe SNPpy, a hybrid script database system using the Python SQLAlchemy library coupled with the PostgreSQL database to manage genotype data from Genome-Wide Association Studies (GWAS). This system makes it possible to merge study data with HapMap data and merge across studies for meta-analyses, including data filtering based on the values of phenotype and Single-Nucleotide Polymorphism (SNP) data. SNPpy and its dependencies are open source software. Results The current version of SNPpy offers utility functions to import genotype and annotation data from two commercial platforms. We use these to import data from two GWAS studies and the HapMap Project. We then export these individual datasets to standard data format files that can be imported into statistical software for downstream analyses. Conclusions By leveraging the power of relational databases, SNPpy offers integrated management and manipulation of genotype and phenotype data from GWAS studies. The analysis of these studies requires merging across GWAS datasets as well as patient and marker selection. To this end, SNPpy enables the user to filter the data and output the results as standardized GWAS file formats. It does low level and flexible data validation, including validation of patient data. SNPpy is a practical and extensible solution for investigators who seek to deploy central management of their GWAS data.
knowledge discovery and data mining | 2014
Herodotos Herodotou; Bolin Ding; Shobana Balakrishnan; Geoff Outhred; Percy Fitter
Large-scale data center networks are complex---comprising several thousand network devices and several hundred thousand links---and form the critical infrastructure upon which all higher-level services depend on. Despite the built-in redundancy in data center networks, performance issues and device or link failures in the network can lead to user-perceived service interruptions. Therefore, determining and localizing user-impacting availability and performance issues in the network in near real time is crucial. Traditionally, both passive and active monitoring approaches have been used for failure localization. However, data from passive monitoring is often too noisy and does not effectively capture silent or gray failures, whereas active monitoring is potent in detecting faults but limited in its ability to isolate the exact fault location depending on its scale and granularity. Our key idea is to use statistical data mining techniques on large-scale active monitoring data to determine a ranked list of suspect causes, which we refine with passive monitoring signals. In particular, we compute a failure probability for devices and links in near real time using data from active monitoring, and look for statistically significant increases in the failure probability. We also correlate the probabilistic output with other failure signals from passive monitoring to increase the confidence of the probabilistic analysis. We have implemented our approach in the Windows Azure production environment and have validated its effectiveness in terms of localization accuracy, precision, and time to localization using known network incidents over the past three months. The correlated ranked list of devices and links is surfaced as a report that is used by network operators to investigate current issues and identify probable root causes.
extending database technology | 2014
Mostafa Ead; Herodotos Herodotou; Ashraf Aboulnaga; Shivnath Babu
The MapReduce programming model has become widely adopted for large scale analytics on big data. MapReduce systems such as Hadoop have many tuning parameters, many of which have a significant impact on performance. The map and reduce functions that make up a MapReduce job are developed using arbitrary programming constructs, which make them black-box in nature and therefore renders it difficult for users and administrators to make good parameter tuning decisions for a submitted MapReduce job. An approach that is gaining popularity is to provide automatic tuning decisions for submitted MapReduce jobs based on feedback from previously executed jobs. This approach is adopted, for example, by the Starfish system. Starfish and similar systems base their tuning decisions on an execution profile of the MapReduce job being tuned. This execution profile contains summary information about the runtime behavior of the job being tuned, and it is assumed to come from a previous execution of the same job. Managing these execution profiles has not been previously studied. This paper presents PStorM, a profile store and matcher that accurately chooses the relevant profiling information for tuning a submitted MapReduce job from the previously collected profiling information. PStorM can identify accurate tuning profiles even for previously unseen MapReduce jobs. PStorM is currently integrated with the Starfish system, although it can be extended to work with any MapReduce tuning system. Experiments on a large number of MapReduce jobs demonstrate the accuracy and efficiency of profile matching. The results of these experiments show that the profiles returned by PStorM result in tuning decisions that are as good as decisions based on exact profiles collected during pervious executions of the tuned jobs. This holds even for previously unseen jobs, which significantly reduces the overhead of feedback-driven profile-based MapReduce tuning.
international conference on management of data | 2017
Elena Kakoulli; Herodotos Herodotou
The ever-growing data storage and I/O demands of modern large-scale data analytics are challenging the current distributed storage systems. A promising trend is to exploit the recent improvements in memory, storage media, and networks for sustaining high performance and low cost. While past work explores using memory or SSDs as local storage or combine local with network-attached storage in cluster computing, this work focuses on managing multiple storage tiers in a distributed setting. We present OctopusFS, a novel distributed file system that is aware of heterogeneous storage media (e.g., memory, SSDs, HDDs, NAS) with different capacities and performance characteristics. The system offers a variety of pluggable policies for automating data management across the storage tiers and cluster nodes. The policies employ multi-objective optimization techniques for making intelligent data management decisions based on the requirements of fault tolerance, data and load balancing, and throughput maximization. At the same time, the storage media are explicitly exposed to users and applications, allowing them to choose the distribution and placement of replicas in the cluster based on their own performance and fault tolerance requirements. Our extensive evaluation shows the immediate benefits of using OctopusFS with data-intensive processing systems, such as Hadoop and Spark, in terms of both increased performance and better cluster utilization.
mobile data management | 2016
Salah Eddin Alshaal; Stylianos Michael; Andreas Pamporis; Herodotos Herodotou; George Samaras; Panayiotis Andreou
The proliferation of wearable and smartphone devices with embedded sensors has enabled researchers and engineers to study and understand user behavior at an extremely high fidelity, particularly for use in industries such as entertainment, health, and retail. However, identified user patterns are yet to be integrated into modern systems with immersive capabilities, such as VR systems, which still remain constrained by limited application interaction models exposed to developers. In this paper, we present Smart VR, a platform that allows developers to seamlessly incorporate user behavior into VR apps. We present the high-level architecture of Smart VR, and show how it facilitates communication, data acquisition, and context recognition between smart wearable devices and mediator systems (e.g., smartphones, tablets, PCs). We demonstrate Smart VR in the context of a VR app for retail stores to show how it can be used to substitute the requirement of cumbersome input devices (e.g., mouse, keyboard) with more natural means of user-app interaction (e.g., user gestures such as swiping and tapping) to improve user experience.
very large data bases | 2018
Elena Kakoulli; Nikolaos D. Karmiris; Herodotos Herodotou
The continuous improvements in memory, storage devices, and network technologies of commodity hardware introduce new challenges and opportunities in tiered storage management. Whereas past work is exploiting storage tiers in pairs or for specific applications, OctopusFS---a novel distributed file system that is aware of the underlying storage media---offers a comprehensive solution to managing multiple storage tiers in a distributed setting. OctopusFS contains auto-mated data-driven policies for managing the placement and retrieval of data across the nodes and storage tiers of the cluster. It also exposes the network locations and storage tiers of the data in order to allow higher-level systems to make locality-aware and tier-aware decisions. This demonstration will showcase the web interface of OctopusFS, which enables users to (i) view detailed utilization information for the various storage tiers and nodes, (ii) browse the directory namespace and perform file-related actions, and (iii) execute caching-related operations while observing their performance impact on MapReduce and Spark workloads.
Archive | 2017
Herodotos Herodotou
The amount of data collected by modern industrial, government, and academic organizations has been increasing exponentially and will continue to grow at an accelerating rate for the foreseeable future. At companies across all industries, servers are overflowing with usage logs, message streams, transaction records, sensor data, business operations records, and mobile device data. Effectively analyzing these massive collections of data (“big data”) can create significant value for the world economy by enhancing productivity, increasing efficiency, and delivering more value to consumers. The need to convert raw data into useful information has led to the development of advanced and unique data storage, management, analysis, and visualization technologies, especially over the last decade. This monograph is an attempt to cover the design principles and core features of systems for analyzing very large datasets for business purposes. In particular, we organize systems into four main categories based on major and distinctive technological innovations. Parallel databases dating back to 1980s have added techniques like columnar data storage and processing, while new distributed platforms like MapReduce have been developed. Other innovations aimed at creating alternative system architectures for more generalized dataflow applications. Finally, the growing demand for interactive analytics has led to the emergence of a new class of systems that combine analytical and transactional capabilities.