Is this you? Create Your Porfile

Diana Moise

French Institute for Research in Computer Science and Automation

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Diana Moise is active.

Explore More

Publication

Featured researches published by Diana Moise.

Journal of Parallel and Distributed Computing | 2011

BlobSeer: Next-generation data management for large scale infrastructures

Bogdan Nicolae; Gabriel Antoniu; Luc Bougé; Diana Moise; Alexandra Carpen-Amarie

As data volumes increase at a high speed in more and more application fields of science, engineering, information services, etc., the challenges posed by data-intensive computing gain increasing importance. The emergence of highly scalable infrastructures, e.g. for cloud computing and for petascale computing and beyond, introduces additional issues for which scalable data management becomes an immediate need. This paper makes several contributions. First, it proposes a set of principles for designing highly scalable distributed storage systems that are optimized for heavy data access concurrency. In particular, we highlight the potentially large benefits of using versioning in this context. Second, based on these principles, we propose a set of versioning algorithms, both for data and metadata, that enable a high throughput under concurrency. Finally, we implement and evaluate these algorithms in the BlobSeer prototype, that we integrate as a storage backend in the Hadoop MapReduce framework. We perform extensive microbenchmarks as well as experiments with real MapReduce applications: they demonstrate that applying the principles defended in our approach brings substantial benefits to data intensive applications.

international conference on multimedia retrieval | 2013

Indexing and searching 100M images with map-reduce

Diana Moise; Denis Shestakov; Gylfi Thór Gudmundsson; Laurent Amsaleg

Most researchers working on high-dimensional indexing agree on the following three trends: (i) the size of the multimedia collections to index are now reaching millions if not billions of items, (ii) the computers we use every day now come with multiple cores and (iii) hardware becomes more available, thanks to easier access to Grids and/or Clouds. This paper shows how the Map-Reduce paradigm can be applied to indexing algorithms and demonstrates that great scalability can be achieved using Hadoop, a popular Map-Reduce-based framework. Dramatic performance improvements are not however guaranteed a priori: such frameworks are rigid, they severely constrain the possible access patterns to data and scares resource RAM has to be shared. Furthermore, algorithms require major redesign, and may have to settle for sub-optimal behavior. The benefits, however, are many: simplicity for programmers, automatic distribution, fault tolerance, failure detection and automatic re-runs and, last but not least, scalability. We share our experience of adapting a clustering-based high-dimensional indexing algorithm to the Map-Reduce model, and of testing it at large scale with Hadoop as we index 30 billion SIFT descriptors. We foresee that lessons drawn from our work could minimize time, effort and energy invested by other researchers and practitioners working in similar directions.

international conference on big data | 2013

Terabyte-scale image similarity search: Experience and best practice

Diana Moise; Denis Shestakov; Gylfi Þór Gudmundsson; Laurent Amsaleg

While the past decade has witnessed an unprecedented growth of data generated and collected all over the world, existing data management approaches lack the ability to address the challenges of Big Data. One of the most promising tools for Big Data processing is the MapReduce paradigm. Although it has its limitations, the MapReduce programming model has laid the foundations for answering some of the Big Data challenges. In this paper, we focus on Hadoop, the open-source implementation of the MapReduce paradigm. Using as case-study a Hadoop-based application, i.e., image similarity search, we present our experiences with the Hadoop framework when processing terabytes of data. The scale of the data and the application workload allowed us to test the limits of Hadoop and the efficiency of the tools it provides. We present a wide collection of experiments and the practical lessons we have drawn from our experience with the Hadoop environment. Our findings can be shared as best practices and recommendations to the Big Data researchers and practioners.

international conference on data management in grid and p2p systems | 2012

MapReduce Applications in the Cloud: A Cost Evaluation of Computation and Storage

Diana Moise; Alexandra Carpen-Amarie

MapReduce is a powerful paradigm that enables rapid implementation of a wide range of distributed data-intensive applications. The Hadoop project, its main open source implementation, has recently been widely adopted by the Cloud computing community. This paper aims to evaluate the cost of moving MapReduce applications to the Cloud, in order to find a proper trade-off between cost and performance for this class of applications. We provide a cost evaluation of running MapReduce applications in the Cloud, by looking into two aspects: the overhead implied by the execution of MapReduce jobs in the Cloud, compared to an execution on a Grid, and the actual costs of renting the corresponding Cloud resources. For our evaluation, we compared the runtime of 3 MapReduce applications executed with the Hadoop framework, in two environments: 1)on clusters belonging to the Grid’5000 experimental grid testbed and 2)in a Nimbus Cloud deployed on top of Grid’5000 nodes.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Large-scale distributed storage for highly concurrent Mapreduce applications

Diana Moise; Gabriel Antoniu; Luc Bougé

A large part of todays most popular applications are data-intensive. Whether they are scientific applications or Internet services, the data volume they process is continuously growing. Two main aspects arise when trying to accomodate the size of the data: processing the computation in a manner that is efficient both in terms of resources and time, and providing storage capable to deal with the requirements of data-intensive applications. Since the input data is large, the computation, which is, in most cases straightforward, is distributed across hundreds or thousands of machines; thus, the application is split into tasks that run in parallel on different machines, tasks that will need to access the data in a highly concurrent manner.

international parallel and distributed processing symposium | 2010