Elif Dede | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Elif Dede is active.

Explore More

Publication

Featured researches published by Elif Dede.

scientific cloud computing | 2013

Performance evaluation of a MongoDB and hadoop platform for scientific data analysis

Elif Dede; Madhusudhan Govindaraju; Daniel K. Gunter; Richard Shane Canon; Lavanya Ramakrishnan

Scientific facilities such as the Advanced Light Source (ALS) and Joint Genome Institute and projects such as the Materials Project have an increasing need to capture, store, and analyze dynamic semi-structured data and metadata. A similar growth of semi-structured data within large Internet service providers has led to the creation of NoSQL data stores for scalable indexing and MapReduce for scalable parallel analysis. MapReduce and NoSQL stores have been applied to scientific data. Hadoop, the most popular open source implementation of MapReduce, has been evaluated, utilized and modified for addressing the needs of different scientific analysis problems. ALS and the Materials Project are using MongoDB, a document oriented NoSQL store. However, there is a limited understanding of the performance trade-offs of using these two technologies together.In this paper we evaluate the performance, scalability and fault-tolerance of using MongoDB with Hadoop, towards the goal of identifying the right software environment for scientific data analysis.

grid computing | 2011

Benchmarking MapReduce Implementations for Application Usage Scenarios

Zacharia Fadika; Elif Dede; Madhusudhan Govindaraju; Lavanya Ramakrishnan

The MapReduce paradigm provides a scalable model for large scale data-intensive computing and associated fault-tolerance. With data production increasing daily due to ever growing application needs, scientific endeavors, and consumption, the MapReduce model and its implementations need to be further evaluated, improved, and strengthened. Several MapReduce frameworks with various degrees of conformance to the key tenets of the model are available today, each, optimized for specific features. HPC application and middleware developers must thus understand the complex dependencies between MapReduce features and their application. We present a standard benchmark suite for quantifying, comparing, and contrasting the performance of MapReduce platforms under a wide range of representative use cases. We report the performance of three different MapReduce implementations on the benchmarks, and draw conclusions about their current performance characteristics. The three platforms we chose for evaluation are the widely used Apache Hadoop implementation, Twister, which has been discussed in the literature, and LEMO-MR, our own implementation. The performance analysis we perform also throws light on the available design decisions for future implementations, and allows Grid researchers to choose the MapReduce implementation that best suits their applications needs.

international conference on e-science | 2012

MARISSA: MApReduce Implementation for Streaming Science Applications

Elif Dede; Zacharia Fadika; Jessica Hartog; Madhusudhan Govindaraju; Lavanya Ramakrishnan; D. Gunter; R. Canon

MapReduce has since its inception been steadily gaining ground in various scientific disciplines ranging from space exploration to protein folding. The model poses a challenge for a wide range of current and legacy scientific applications for addressing their “Big Data” challenges. For example: MapRe-duces best known implementation, Apache Hadoop, only offers native support for Java applications. While Hadoop streaming supports applications compiled in a variety of languages such as C, C++, Python and FORTRAN, streaming has shown to be a less efficient MapReduce alternative in terms of performance, and effectiveness. Additionally, Hadoop streaming offers lesser options than its native counterpart, and as such offers less flexibility along with a limited array of features for scientific software. The Hadoop File System (HDFS), a central pillar of Apache Hadoop is not a POSIX compliant file system. In this paper, we present an alternative framework to Hadoop streaming to address the needs of scientific applications: MARISSA (MApReduce Implementation for Streaming Science Applications). We describe MARISSAs design and explain how it expands the scientific applications that can benefit from the MapReduce model. We also compare and explain the performance gains of MARISSA over Hadoop streaming.

many task computing on grids and supercomputers | 2011

Riding the elephant: managing ensembles with hadoop

Elif Dede; Madhusudhan Govindaraju; Daniel K. Gunter; Lavanya Ramakrishnan

Many important scientific applications do not fit the traditional model of a monolithic simulation running on thousands of nodes. Scientific workflows -- such as the Materials Genome project, Energy Frontiers Research Center for Gas Separations Relevant to Clean Energy Technologies, climate simulations, and Uncertainty Quantification in fluid and solid dynamics { all run large numbers of parallel analyses, which we call scientific ensembles. These scientific ensembles have a large number of tasks with control and data dependencies. Current tools for creating and managing these ensembles in HPC environments are limited and difficult to use; this is proving to be a limiting factor to running scientific ensembles at the large scale enabled by these HPC environments. MapReduce and its open-source implementation, Hadoop, is an attractive paradigm due to the simplicity of the programming model and intrinsic mechanisms for handling scalability and fault-tolerance. In this paper, we evaluate the programmability of MapReduce and Hadoop for scientific workflow ensembles.

IEEE Transactions on Services Computing | 2016

Processing Cassandra Datasets with Hadoop-Streaming Based Approaches

Elif Dede; Bedri Sendir; Pinar Kuzlu; J. Weachock; Madhusudhan Govindaraju; Lavanya Ramakrishnan

The progressive transition in the nature of both scientific and industrial datasets has been the driving force behind the development and research interests in the NoSQL model. Loosely structured data poses a challenge to traditional data store systems, and when working with the NoSQL model, these systems are often considered impractical and costly. As the quantity and quality of unstructured data grows, so does the demand for a processing pipeline that is capable of seamlessly combining the NoSQL storage model and a “Big Data” processing platform such as MapReduce. Although MapReduce is the paradigm of choice for data-intensive computing, Java-based frameworks such as Hadoop require users to write MapReduce code in Java while Hadoop Streaming module allows users to define non-Java executables as map and reduce operations. When confronted with legacy C/C++ applications and other non-Java executables, there arises a further need to allow NoSQL data stores access to the features of Hadoop Streaming. We present approaches in solving the challenge of integrating NoSQL data stores with MapReduce under non-Java application scenarios, along with advantages and disadvantages of each approach. We compare Hadoop Streaming alongside our own streaming framework, MARISSA, to show performance implications of coupling NoSQL data stores like Cassandra with MapReduce frameworks that normally rely on file-system based data stores. Our experiments also include Hadoop-C*, which is a setup where a Hadoop cluster is co-located with a Cassandra cluster in order to process data using Hadoop with non-java executables.

Future Generation Computer Systems | 2014

Benchmarking MapReduce implementations under different application scenarios

Elif Dede; Zacharia Fadika; Madhusudhan Govindaraju; Lavanya Ramakrishnan

Abstract The MapReduce paradigm provides a scalable model for large scale data intensive computing and associated fault-tolerance. Data volumes generated and processed by scientific applications are growing rapidly. Several MapReduce implementations, with various degrees of conformance to the key tenets of the model, are available today. Each of these implementations is optimized for specific features. To make the right decisions, HPC application and middleware developers must thus understand the complex dependences between MapReduce features and their application. We present a set of benchmarks for quantifying, comparing, and contrasting the performance of MapReduce implementations under a wide range of representative use cases. To demonstrate the utility of the benchmarks and to provide a snapshot of the current implementation landscape, we report the performance of three different MapReduce implementations, and draw conclusions about their current performance characteristics. The three implementations we chose for evaluation are the widely used Hadoop implementation, Twister, which has been widely discussed in the literature in the context of scientific applications, and LEMO-MR which is our own implementation. We present the performance of these three implementations and draw conclusions about their performance characteristics.

international conference on cloud computing | 2012

Configuring a MapReduce Framework for Dynamic and Efficient Energy Adaptation

Jessica Hartog; Zacharia Fadika; Elif Dede; Madhusudhan Govindaraju

MapReduce has become a popular framework for Big Data applications. While MapReduce has received much praise for its scalability and efficiency, it has not been thoroughly evaluated for power consumption. Our goal with this paper is to explore the possibility of scheduling in a power-efficient manner without the need for expensive power monitors on every node. We begin by considering that no cluster is truly homogeneous with respect to energy consumption. From there we develop a MapReduce framework that can evaluate the current status of each node and dynamically react to estimated power usage. Inso doing, we shift power consumption work toward more energy efficient nodes which are currently consuming less power. Our work shows that given an ideal framework configuration, certain nodes may consume only 62.3% of the dynamic power they consumed when the same framework was configured as it would be in a traditional MapReduce implementation.

Future Generation Computer Systems | 2014

MARIANE: Using MApReduce in HPC environments

Zacharia Fadika; Elif Dede; Madhusudhan Govindaraju; Lavanya Ramakrishnan

Abstract MapReduce is increasingly becoming a popular programming model. However, the widely used implementation, Apache Hadoop, uses the Hadoop Distributed File System (HDFS), which is currently not directly applicable to a majority of existing HPC environments such as Teragrid and NERSC that support other distributed file systems. On such resourceful High Performance Computing (HPC) infrastructures, the MapReduce model can rarely make use of full resources, as special circumstances must be created for its adoption, or simply limited resources must be isolated to the same end. This paper not only presents a MapReduce implementation directly suitable for such environments, but also exposes the design choices for better performance gains in those settings. By leveraging inherent distributed file systems’ functions, and abstracting them away from its MapReduce framework, MARIANE (MApReduce Implementation Adapted for HPC Environments) not only allows for the use of the model in an expanding number of HPC environments, but also shows better performance in such settings. This paper identifies the components and trade-offs necessary for this model, and quantifies the performance gains exhibited by our approach in HPC environments over Apache Hadoop in a data intensive setting at the National Energy Research Scientific Computing Center (NERSC).

grid computing | 2011

Scalable and Distributed Processing of Scientific XML Data

Elif Dede; Zacharia Fadika; Chaitali Gupta; Madhusudhan Govindaraju

A seamless and intuitive search capability for the vast amount of datasets generated by scientific experiments is critical to ensure effective use of such data by domain specific scientists. Currently, searches on enormous XML datasets is done manually via custom scripts or by using hard-to-customize queries developed by experts in complex and disparate XML query languages. Such approaches however do not provide acceptable performance for large-scale data since they are not based on a scalable distributed solution. Furthermore, it has been shown that databases are not optimized for queries on XML data generated by scientific experiments, as term kinship, range based queries, and constraints such as conjunction and negation need to be taken into account. There exists a critical need for an easy-to-use and scalable framework, specialized for scientific data, that provides natural-language-like syntax along with accurate results. As most existing search tools are designed for exact string matching, which is not adequate for scientific needs, we believe that such a framework will enhance the productivity and quality of scientific research by the data reduction capabilities it can provide. This paper presents how the MapReduce model should be used in XML metadata indexing for scientific datasets, specifically TeraGrid Information Services and the NeXus datasets generated by the Spallation Neutron Source (SNS) scientists. We present an indexing structure that scales well for large-scale MapReduce processing. We present performance results using two MapReduce implementations, Apache Hadoop and LEMO-MR, to emphasize the flexibility and adaptability of our framework in different MapReduce environments.

high performance distributed computing | 2011

Adapting MapReduce for HPC environments

Zacharia Fadika; Elif Dede; Madhusudhan Govindaraju; Lavanya Ramakrishnan

MapReduce is increasingly gaining popularity as a programming model for use in large-scale distributed processing. The model is most widely used when implemented using the Hadoop Distributed File System (HDFS). The use of the HDFS, however, precludes the direct applicability of the model to HPC environments, which use high performance distributed file systems. In such distributed environments, the MapReduce model can rarely make use of full resources, as local disks may not be available for data placement on all the nodes. This work proposes a MapReduce implementation and design choices directly suitable for such HPC environments.

Explore More