Andrea Rosà | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrea Rosà is active.

Explore More

Publication

Featured researches published by Andrea Rosà.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Predicting and Mitigating Jobs Failures in Big Data Clusters

Andrea Rosà; Lydia Y. Chen; Walter Binder

In large-scale data enters, software and hardware failures are frequent, resulting in failures of job executions that may cause significant resource waste and performance deterioration. To proactively minimize the resource inefficiency due to job failures, it is important to identify them in advance using key job attributes. However, so far, prevailing research on datacenter workload characterization has overlooked job failures, including their patterns, root causes, and impact. In this paper, we aim to develop prediction models and mitigation policies for unsuccessful jobs, so as to reduce the resource waste in big data enters. In particular, we base our analysis on Google cluster traces, consisting of a large number of big-data jobs with a high task fan-out. We first identify the time-varying patterns of failed jobs and the contributing system features. Based on our characterization study, we develop an on-line predictive model for job failures by applying various statistical learning techniques, namely Linear Discriminate Analysis (LDA), Quadratic Discriminate Analysis (QDA), and Logistic Regression (LR). Furthermore, we propose a delay-based mitigation policy which, after a certain grace period, proactively terminates the execution of jobs that are predicted to fail. The particular objective of postponing job terminations is to strike a good tradeoffs between resource waste and false prediction of successful jobs. Our evaluation results show that the proposed method is able to significantly reduce the resource waste by 41.9% on average, and keep false terminations of jobs low, i.e., only 1%.

dependable systems and networks | 2015

Understanding the Dark Side of Big Data Clusters: An Analysis beyond Failures

Andrea Rosà; Lydia Y. Chen; Walter Binder

Motivated by the high system complexity of todays datacenters, a large body of related studies tries to understand workloads and resource utilization in datacenters. However, there is little work on exploring unsuccessful job and task executions. In this paper, we study three types of unsuccessful executions in traces of a Google datacenter, namely fail, kill, and eviction. The objective of our analysis is to identify their resource waste, impacts on application performance, and root causes. We first quantitatively show their strong negative impact on CPU, RAM, and DISK usage and on task slowdown. We analyze patterns of unsuccessful jobs and tasks, particularly focusing on their interdependency. Moreover, we uncover their root causes by inspecting key workload and system attributes such as machine locality and concurrency level. Our results help in the design of low-latency and fault-tolerant big-data systems.

measurement and modeling of computer systems | 2015

Demystifying Casualties of Evictions in Big Data Priority Scheduling

Andrea Rosà; Lydia Y. Chen; Robert Birke; Walter Binder

The ever increasing size and complexity of large-scale datacenters enhance the difficulty of developing efficient scheduling policies for big data systems, where priority scheduling is often employed to guarantee the allocation of system resources to high priority tasks, at the cost of task preemption and resulting resource waste. A large number of related studies focuses on understanding workloads and their performance impact on such systems; nevertheless, existing works pay little attention on evicted tasks, their characteristics, and the resulting impairment on the system performance. In this paper, we base our analysis on Google cluster traces, where tasks can experience three diffierent types of unsuccessful events, namely eviction, kill and fail. We particularly focus on eviction events, i.e., preemption of task execution due to higher priority tasks, and rigorously quantify their performance drawbacks, in terms of wasted machine time and resources, with particular focus on priority. Motivated by the high dependency of eviction on underlying scheduling policies, we also study its statistical patterns and its dependency on other types of unsuccessful events. Moreover, by considering co-executed tasks and system load, we deepen the knowledge on priority scheduling, showing how priority and machine utilization affect the eviction process and related tasks.

annual erlang workshop | 2016

Profiling actor utilization and communication in Akka

Andrea Rosà; Lydia Y. Chen; Walter Binder

Several programming languages and frameworks offer actor-based concurrency inspired by Erlang. Among them, Akka has been adopted in numerous applications and frameworks running on the Java Virtual Machine. Unfortunately, despite the spread of Akka-based applications, there are few dedicated profilers. In this paper, we aim at filling this gap by presenting a novel profiling tool for Akka applications. In contrast to existing profilers for Akka, our tool focuses particularly on actor utilization and on the communication between them. We evaluate our tool on various applications and frameworks in both parallel and distributed settings, such as Signal/Collect, Apache Spark and Apache Flink. Our results show that our profiler helps understanding actor utilization, investigating load balancing in computing frameworks, and analyzing communication performance in the message exchange process.

international workshop on quality of service | 2015

Catching failures of failures at big-data clusters: A two-level neural network approach

Andrea Rosà; Lydia Y. Chen; Walter Binder

Big-data applications are becoming the core of todays business operations, featuring complex data structures and high task fan-out. According to the publicly available Google trace, more than 40% of big-data jobs do not reach successful completion. Interestingly, a significant portion of tasks of such failed jobs undergo multiple types of repetitive failed executions and consume a non-negligible amount of resources. To conserve resources for big-data clusters, it is imperative to capture such failed tasks of failed jobs, a very challenging problem due to multiple types of failures associated with tasks and highly uneven tasks distribution. In this paper, we develop an on-line two-level Neural Network (NN) model which can accurately untangle the complex dependencies among tasks and jobs, and predict their execution classes in an extremely dynamic and heterogeneous system. Our proposed NN model predicts first the job class, and secondly three classes of failed tasks of failed jobs, based on a sliding learning window. Furthermore, we develop resource conservation policies that terminate failed tasks of failed jobs after a grace period that is derived from prediction confidences and task execution times. Overall, evaluating our results on a Google cluster trace, we are able to accurately capture failures of failures at big-data clusters, mitigate false negative tasks to 1%, and efficiently save system resources, achieving significant reductions of CPU, memory and disk consumption - as high as 49%.

Sigplan Notices | 2016

Actor profiling in virtual execution environments

Andrea Rosà; Lydia Y. Chen; Walter Binder

Nowadays, many virtual execution environments benefit from concurrency offered by the actor model. Unfortunately, while actors are used in many applications, existing profiling tools are not much effective in analyzing the performance of applications using actors. In this paper, we present a new instrumentation-based technique to profile actors in virtual execution environments. Our technique adopts platform-independent profiling metrics that minimize the perturbations induced by the instrumentation logic and allow comparing profiling results across different platforms. In particular, our technique measures the initialization cost, the amount of executed computations, and the messages sent and received by each actor. We implement our technique within a profiling tool for Akka actors on the Java platform. Evaluation results show that our profiling technique helps performance analysis of actor utilization and communication between actors in large-scale computing frameworks.

symposium on code generation and optimization | 2018

Analyzing and optimizing task granularity on the JVM

Andrea Rosà; Eduardo Rosales; Walter Binder

Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, resulting in missed parallelization opportunities. In this paper, we provide a better understanding of task granularity for applications running on a Java Virtual Machine. We present a novel profiler which measures the granularity of every executed task. Our profiler collects carefully selected metrics from the whole system stack with only little overhead, and helps the developer locate performance problems. We analyze task granularity in the DaCapo and ScalaBench benchmark suites, revealing several inefficiencies related to fine-grained and coarse-grained tasks. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in two benchmarks, achieving speedups up to 1.53x.

runtime verification | 2016

Extended Code Coverage for AspectJ-Based Runtime Verification Tools

Omar Javed; Yudi Zheng; Andrea Rosà; Haiyang Sun; Walter Binder

Many runtime verification tools for the Java virtual machine rely on aspect-oriented programming, particularly on AspectJ, to weave the verification logic into the observed program. However, AspectJ imposes several limitations on the verification tools, such as a restricted join point model and the inability of weaving certain classes, particularly the Java and Android class libraries. In this paper, we show that our domain-specific aspect language DiSL can overcome these limitations. While offering a programming model akin to AspectJ, DiSL features an extensible join point model and ensures weaving with complete bytecode coverage for Java and Android. We present a new compiler that translates runtime-verification aspects written in AspectJ to DiSL. Hence, it is possible to use existing, unmodified runtime verification tools on top of the DiSL framework to bypass the limitations of AspectJ. As a case study, we show that the AspectJ-based runtime verification tool JavaMOP significantly benefits from the automated translation of AspectJ to DiSL code, gaining increased code coverage. Thanks to DiSL, JavaMOP analyses are able to unveil violations in the Java class library that cannot be detected when using AspectJ.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Understanding Unsuccessful Executions in Big-Data Systems

Andrea Rosà; Lydia Y. Chen; Walter Binder

Big-data applications are being increasingly used in todays large-scale data enters for a large variety of purposes, such as solving scientific problems, running enterprise services, and computing data-intensive tasks. Due to the growing scale of these systems and the complexity of running applications, jobs running in big-data systems experience unsuccessful terminations of different nature. While a large body of existing studies sheds light on failures occurred in large-scale data enters, the current literature overlooks the characteristics and the performance impairment of a broader class of unsuccessful executions which can arise due to application failures, dependency violations, machine constraints, job kills, and task pre-emption. Nonetheless, deepening our understanding in this field is of paramount importance, as unsuccessful executions can lower user satisfaction, impair reliability, and lead to a high resource waste. In this paper, we describe the problem of unsuccessful executions in big-data systems, and highlight the critical importance of improving our knowledge on this subject. We review the existing literature on this field, discuss its limitations, and present our own contributions to the problem, along with our research plan for the future.

Conference Companion of the 2nd International Conference on Art, Science, and Engineering of Programming | 2018

Understanding task granularity on the JVM: profiling, analysis, and optimization

Andrea Rosà; Eduardo Rosales; Filippo Schiavio; Walter Binder

Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, resulting in missed parallelization opportunities. We focus on task-parallel applications running in a single Java Virtual Machine on a shared-memory multicore. Despite their performance may considerably depend on the granularity of their tasks, related analyses and optimizations have received little attention in the literature. In this paper, we advocate the need for providing a better understanding of task granularity for such applications. We discuss the importance of improving our knowledge on this topic, and highlight the related challenges. We present new approaches to profile, analyze and optimize task granularity, and discuss the results obtained so far.

Explore More