Alexandru Costan
Intelligence and National Security Alliance
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alexandru Costan.
Computer Physics Communications | 2009
I. Legrand; Harvey B Newman; Ramiro Voicu; Catalin Cirstoiu; C. Grigoras; Ciprian Dobre; Adrian Muraru; Alexandru Costan; M. Dediu; Corina Stratan
The MonALISA (Monitoring Agents in a Large Integrated Services Architecture) framework provides a set of distributed services for monitoring, control, management and global optimization for large scale distributed systems. It is based on an ensemble of autonomous, multi-threaded, agent-based subsystems which are registered as dynamic services. They can be automatically discovered and used by other services or clients. The distributed agents can collaborate and cooperate in performing a wide range of management, control and global optimization tasks using real time monitoring information.
complex, intelligent and software intensive systems | 2010
Elvin Sindrilaru; Alexandru Costan; Valentin Cristea
Complex scientific workflows are now commonly executed on global grids. With the increasing scale complexity, heterogeneity and dynamism of grid environments the challenges of managing and scheduling these workflows are augmented by dependability issues due to the inherent unreliable nature of large-scale grid infrastructure. In addition to the traditional fault tolerance techniques, specific checkpoint-recovery schemes are needed in current grid workflow management systems to address these reliability challenges. Our research aims to design and develop mechanisms for building an autonomic workflow management system that will exhibit the ability to detect, diagnose, notify, react and recover automatically from failures of workflow execution. In this paper we present the development of a Fault Tolerance and Recovery component that extends the ActiveBPEL workflow engine. The detection mechanism relies on inspecting the messages exchanged between the workflow and the orchestrated Web Services in search of faults. The recovery of a process from a faulted state has been achieved by modifying the default behavior of ActiveBPEL and it basically represents a non-intrusive checkpointing mechanism. We present the results of several scenarios that demonstrate the functionality of the Fault Tolerance and Recovery component, outlining an increase in performance of about 50% in comparison to the traditional method of resubmitting the workflow.
distributed event-based systems | 2014
Radu Tudoran; Olivier Nano; Ivo Santos; Alexandru Costan; Hakan Soncu; Luc Bougé; Gabriel Antoniu
The easily-accessible computation power offered by cloud infrastructures coupled with the revolution of Big Data are expanding the scale and speed at which data analysis is performed. In their quest for finding the Value in the 3 Vs of Big Data, applications process larger data sets, within and across clouds. Enabling fast data transfers across geographically distributed sites becomes particularly important for applications which manage continuous streams of events in real time. Scientific applications (e.g. the Ocean Observatory Initiative or the ATLAS experiment) as well as commercial ones (e.g. Microsofts Bing and Office 365 large-scale services) operate on tens of data-centers around the globe and follow similar patterns: they aggregate monitoring data, assess the QoS or run global data mining queries based on inter site event stream processing. In this paper, we propose a set of strategies for efficient transfers of events between cloud data-centers and we introduce JetStream: a prototype implementing these strategies as a high performance batch-based streaming middleware. JetStream is able to self-adapt to the streaming conditions by modeling and monitoring a set of context parameters. It further aggregates the available bandwidth by enabling multi-route streaming across cloud sites. The prototype was validated on tens of nodes from US and Europe data-centers of the Windows Azure cloud using synthetic benchmarks and with application code from the context of the Alice experiment at CERN. The results show an increase in transfer rate of 250 times over individual event streaming. Besides, introducing an adaptive transfer strategy brings an additional 25% gain. Finally, the transfer rate can further be tripled thanks to the use of multi-route streaming.
ACM Queue | 2009
I. Legrand; Ramiro Voicu; Catalin Cirstoiu; C. Grigoras; L. Betev; Alexandru Costan
MonALISA developers describe how it works, the key design principles behind it, and the biggest technical challenges in building it.
cluster computing and the grid | 2014
Radu Tudoran; Alexandru Costan; Rui Wang; Luc Bougé; Gabriel Antoniu
Todays continuously growing cloud infrastructures provide support for processing ever increasing amounts of scientific data. Cloud resources for computation and storage are spread among globally distributed datacenters. Thus, to leverage the full computation power of the clouds, global data processing across multiple sites has to be fully enabled. However, managing data across geographically distributed data enters is not trivial as it involves high and variable latencies among sites which come at a high monetary cost. In this work, we propose a uniform data management system for scientific applications running across geographically distributed sites. Our solution is environment-aware, as it monitors and models the global cloud infrastructure, and offers predictable data handling performance for transfer cost and time. In terms of efficiency, it provides the applications with the possibility to set a trade off between money and time and optimizes the transfer strategy accordingly. The system was validated on Microsofts Azure cloud across the 6 EU and US data enters. The experiments were conducted on hundreds of nodes using both synthetic benchmarks and the real life A-Brain application. The results show that our system is able to model and predict well the cloud performance and to leverage this into efficient data dissemination. Our approach reduces the monetary costs and transfer time by up to 3 times.
Proceedings of third international workshop on MapReduce and its Applications Date | 2012
Radu Tudoran; Alexandru Costan; Gabriel Antoniu
With the emergence of cloud computing as an alternative to supercomputers to support data intensive applications, MapReduce has arisen as a major programming model for data analysis on clouds. In this context, reduce-intensive algorithms are becoming increasingly useful in applications such as data clustering, classification and mining. However, platforms like MapReduce or Dryad lack built-in support for reduce-intensive workloads. This paper introduces MapIterativeReduce, a framework which 1) extends the MapReduce programming model to better support reduce-intensive applications and 2) substantially improves their efficiency by eliminating the implicit barrier between the Map and the Reduce phase. We evaluated MapIterativeReduce on the Microsoft Azure cloud with synthetic benchmarks and with a real-life application. Compared to state-of-art solutions, our approach reduces the execution times by up to 75%.
Concurrency and Computation: Practice and Experience | 2016
Alexandru Costan; Radu Tudoran; Gabriel Antoniu; Goetz Brasche
The emergence of cloud computing has brought the opportunity to use large‐scale compute infrastructures for a broader and broader spectrum of applications and users. As the cloud paradigm gets attractive for the ‘elasticity’ in resource usage and associated costs (the users only pay for resources actually used), cloud applications still suffer from the high latencies and low performance of cloud storage services. As Big Data analysis on clouds becomes more and more relevant in many application areas, enabling high‐throughput massive data processing on cloud data becomes a critical issue, as it impacts the overall application performance. In this paper, we address this challenge at the level of cloud storage. We introduce a concurrency‐optimized data storage system (called TomusBlobs), which federates the virtual disks associated to the Virtual Machines running the application code on the cloud. We demonstrate the performance benefits of our solution for efficient data‐intensive processing by building an optimized prototype MapReduce framework for Microsofts Azure cloud platform on the basis of TomusBlobs. Finally, we specifically address the limitations of state‐of‐the‐art MapReduce frameworks for reduce‐intensive workloads, by proposing MapIterativeReduce as an extension of the MapReduce model. We validate the aforementioned contributions through large‐scale experiments with synthetic benchmarks and with real‐world applications on the Azure commercial cloud by using resources distributed across multiple data centers; they demonstrate that our solutions bring substantial benefits to data‐intensive applications compared with approaches relying on state‐of‐the‐art cloud object storage. Copyright
international conference on cluster computing | 2016
Ovidiu-Cristian Marcu; Alexandru Costan; Gabriel Antoniu; María S. Pérez-Hernández
Big Data analytics has recently gained increasing popularity as a tool to process large amounts of data on-demand. Spark and Flink are two Apache-hosted data analytics frameworks that facilitate the development of multi-step data pipelines using directly acyclic graph patterns. Making the most out of these frameworks is challenging because efficient executions strongly rely on complex parameter configurations and on an in-depth understanding of the underlying architectural choices. Although extensive research has been devoted to improving and evaluating the performance of such analytics frameworks, most of them benchmark the platforms against Hadoop, as a baseline, a rather unfair comparison considering the fundamentally different design principles. This paper aims to bring some justice in this respect, by directly evaluating the performance of Spark and Flink. Our goal is to identify and explain the impact of the different architectural choices and the parameter configurations on the perceived end-to-end performance. To this end, we develop a methodology for correlating the parameter settings and the operators execution plan with the resource usage. We use this methodology to dissect the performance of Spark and Flink with several representative batch and iterative workloads on up to 100 nodes. Our key finding is that there none of the two framework outperforms the other for all data types, sizes and job patterns. This paper performs a fine characterization of the cases when each framework is superior, and we highlight how this performance correlates to operators, to resource usage and to the specifics of the internal framework design.
ieee international conference on cloud computing technology and science | 2013
Gabriel Antoniu; Alexandru Costan; Julien Bigot; Frédéric Desprez; Gilles Fedak; Sylvain Gault; Christian Pérez; Anthony Simonet; Bing Tang; Christophe Blanchet; Raphael Terreux; Luc Bougé; François Briant; Franck Cappello; Kate Keahey; Bogdan Nicolae; Frédéric Suter
As map-reduce emerges as a leading programming paradigm for data-intensive computing, today’s frameworks which support it still have substantial shortcomings that limit its potential scalability. In this paper, we discuss several directions where there is room for such progress: they concern storage efficiency under massive data access concurrency, scheduling, volatility and fault-tolerance. We place our discussion in the perspective of the current evolution towards an increasing integration of large-scale distributed platforms (clouds, cloud federations, enterprise desktop grids, etc.). We propose an approach which aims to overcome the current limitations of existing map-reduce frameworks, in order to achieve scalable, concurrency-optimised, fault-tolerant map-reduce data processing on hybrid infrastructures. This approach will be evaluated with real-life bio-informatics applications on existing Nimbus-powered cloud testbeds interconnected with desktop grids.
Future Generation Computer Systems | 2016
Radu Tudoran; Alexandru Costan; Olivier Nano; Ivo Santos; Hakan Soncu; Gabriel Antoniu
Scientific and commercial applications operate nowadays on tens of cloud datacenters around the globe, following similar patterns: they aggregate monitoring or sensor data, assess the QoS or run global data mining queries based on inter-site event stream processing. Enabling fast data transfers across geographically distributed sites allows such applications to manage the continuous streams of events in real time and quickly react to changes. However, traditional event processing engines often consider data resources as second-class citizens and support access to data only as a side-effect of computation (i.e. they are not concerned by the transfer of events from their source to the processing site). This is an efficient approach as long as the processing is executed in a single cluster where nodes are interconnected by low latency networks. In a distributed environment, consisting of multiple datacenters, with orders of magnitude differences in capabilities and connected by a WAN, this will undoubtedly lead to significant latency and performance variations. This is namely the challenge we address in this paper, by proposing JetStream, a high performance batch-based streaming middleware for efficient transfers of events between cloud datacenters. JetStream is able to self-adapt to the streaming conditions by modeling and monitoring a set of context parameters. It further aggregates the available bandwidth by enabling multi-route streaming across cloud sites, while at the same time optimizing resource utilization and increasing cost efficiency. The prototype was validated on tens of nodes from US and Europe datacenters of the Windows Azure cloud with synthetic benchmarks and a real-life application monitoring the ALICE experiment at CERN. The results show a 3x increase of the transfer rate using the adaptive multi-route streaming, compared to state of the art solutions.