Lydia Y. Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Lydia Y. Chen is active.

Explore More

Publication

Featured researches published by Lydia Y. Chen.

international conference on cloud computing | 2012

Opportunistic Service Provisioning in the Cloud

Mathias Bjorkqvist; Lydia Y. Chen; Walter Binder

There is an emerging trend to deploy services in cloud environments due to their flexibility in providing virtual capacity and pay-as-you-go billing features. Cost-aware services demand computation capacity such as virtual machines (VMs) from a cloud operator according to the workload (i.e., service invocations) and pay for the amount of capacity used following billing contracts. However, as recent empirical studies show, the performance variability, i.e., non-uniform VM performance, is inherently higher than in private hosting platforms, since cloud platforms provide VMs running on top of typically heterogeneous hardware shared by multiple clients. Consequently, the provisioning of service capacity in a cloud needs to consider workload variability as well as varying VM performance. We propose an opportunistic service replication policy that leverages the variability in VM performance, as well as the on-demand billing features of the cloud. Our objective is to minimize the service provisioning costs by keeping a lower number of faster VMs, while maintaining target system utilization. Our evaluation results on traces collected from in-production systems show that the proposed policy achieves significant cost savings and low response times.

dependable systems and networks | 2013

State-of-the-practice in data center virtualization: Toward a better understanding of VM usage

Robert Birke; Andrej Podzimek; Lydia Y. Chen; Evgenia Smirni

Hardware virtualization is the prevalent way to share data centers among different tenants. In this paper we present a large scale workload characterization study that aims to a better understanding of the state-of-the-practice, i.e., how data centers in the private cloud are used by their customers, how physical resources are shared among different tenants using virtualization, and how virtualization technologies are actually employed. Our study focuses on all corporate data centers of a major infrastructure provider that are geographically dispersed across the entire globe and reports on their observed usage across a 19-day period. We especially focus on how virtual machines are deployed across different physical resources with an emphasis on processors and memory, focusing on resource sharing and usage of physical resources, virtual machine life cycles, and migration patterns and frequencies. Our study illustrates that there is a huge tendency in over provisioning resources while being conservative to the several possibilities opened up by virtualization (e.g., migration and co-location), showing tremendous potential for the development of policies aiming to reduce data center operational costs.

international conference on cloud computing | 2012

Data Centers in the Cloud: A Large Scale Performance Study

Robert Birke; Lydia Y. Chen; Evgenia Smirni

With the advancement of virtualization technologies and the benefit of economies of scale, industries are seeking scalable IT solutions, such as data centers hosted either in-house or by a third party. Data center availability, often via a cloud setting, is ubiquitous. Nonetheless, little is known about the in-production performance of data centers, and especially the interaction of workload demands and resource availability. This study fills this gap by conducting a large scale survey of in-production data center servers within a time period that spans two years. We provide in-depth analysis on the time evolution of existing data center demands by providing a holistic characterization of typical data center server workloads, by focusing on their basic resource components, including CPU, memory, and storage systems. We especially focus on seasonality of resource demands and how this is affected by different geographical locations. This survey provides a glimpse on the evolution of data center workloads and provides a basis for an economics analysis that can be used for effective capacity planning of future data centers.

international conference on computer communications | 2011

Minimizing retrieval latency for content cloud

Mathias Bjorkqvist; Lydia Y. Chen; Marko Vukolić; Xi Zhang

Content cloud systems, e.g. CloudFront [1] and CloudBurst [2], in which content items are retrieved by end-users from the edge nodes of the cloud, are becoming increasingly popular. The retrieval latency in content clouds depends on content availability in the edge nodes, which in turn depends on the caching policy at the edge nodes. In case of local content unavailability (i.e., a cache miss), edge nodes resort to source selection strategies to retrieve the content items either vertically from the central server, or horizontally from other edge nodes. Consequently, managing the latency in content clouds needs to take into account several interrelated issues: asymmetric bandwidth and caching capacity for both source types as well as edge node heterogeneity in terms of caching policies and source selection strategies applied. In this paper, we study the problem of minimizing the retrieval latency considering both caching and retrieval capacity of the edge nodes and server simultaneously. We derive analytical models to evaluate the content retrieval latency under two source selection strategies, i.e., Random and Shortest-Queue, and three caching policies: selfish, collective, and a novel caching policy that we call the adaptive caching policy. Our analysis allows the quantification of the interrelated performance impacts of caching and retrieval capacity and the exploration of the corresponding design space. In particular, we show that the adaptive caching policy combined with Shortest-Queue selection scales well with various network configurations and adapts to the load changes in our simulation and analytical results.

high performance distributed computing | 2012

Achieving application-centric performance targets via consolidation on multicores: myth or reality?

Lydia Y. Chen; Danilo Ansaloni; Evgenia Smirni; Akira Yokokawa; Walter Binder

Consolidation of multiple applications with diverse and changing resource requirements is common in multicore systems as hardware resources are abundant and opportunities for better system usage are plenty. Can we maximize resource usage in such a system while respecting individual application performance targets or is it an oxymoron to simultaneously meet such conflicting measures? In this work we provide a solution to the above difficult problem by constructing a queueing-theory based tool that we use to accurately predict application scalability on multicores and that can also provide the optimal consolidation suggestions to maximize system resource usage while meeting simultaneously application performance targets. The proposed methodology is light-weight and relies on capturing application resource demands using standard tools, via nonintrusive low-level measurements. We evaluate our approach on an IBM Power7 system using the DaCapo and SPECjvm benchmark suites where each benchmark exhibits different patterns of parallelism. From 900 different consolidations of application instances, our tool accurately predicts the average iteration time of allocated applications with an average error below 10%.

workshop on power aware computing and systems | 2011

Towards realizing a low cost and highly available datacenter power infrastructure

Sriram Govindan; Di Wang; Lydia Y. Chen; Anand Sivasubramaniam; Bhuvan Urgaonkar

Realizing highly available datacenter power infrastructure is an extremely expensive proposition with costs more than doubling as we move from three 9s (Tier-1) to six 9s (Tier-4) of availability. Existing approaches only consider the cost/availability trade-off for a restricted set of power infrastructure configurations, relying mainly on component redundancy. A number of additional knobs such as centralized vs. distributed component placement and power-feed interconnect topology also exist, whose impact has only been studied in limited forms. In this paper, we develop detailed datacenter availability models using Continuous-time Markov Chains and Reliability Block Diagrams to quantify the cost-availability trade-off offered by these power infrastructure knobs.

dependable systems and networks | 2014

Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics

Robert Birke; Ioana Giurgiu; Lydia Y. Chen; Dorothea Wiesmann; Ton Engbersen

In todays commercial data centers, the computation density grows continuously as the number of hardware components and workloads in units of virtual machines increase. The service availability guaranteed by data centers heavily depends on the reliability of the physical and virtual servers. In this study, we conduct an analysis on 10K virtual and physical machines hosted on five commercial data centers over an observation period of one year. Our objective is to establish a sound understanding of the differences and similarities between failures of physical and virtual machines. We first capture their failure patterns, i.e., the failure rates, the distributions of times between failures and of repair times, as well as, the time and space dependency of failures. Moreover, we correlate failures with the resource capacity and run-time usage to identify the characteristics of failing servers. Finally, we discuss how virtual machine management actions, i.e., consolidation and on/off frequency, impact virtual machine failures.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Predicting and Mitigating Jobs Failures in Big Data Clusters

Andrea Rosà; Lydia Y. Chen; Walter Binder

In large-scale data enters, software and hardware failures are frequent, resulting in failures of job executions that may cause significant resource waste and performance deterioration. To proactively minimize the resource inefficiency due to job failures, it is important to identify them in advance using key job attributes. However, so far, prevailing research on datacenter workload characterization has overlooked job failures, including their patterns, root causes, and impact. In this paper, we aim to develop prediction models and mitigation policies for unsuccessful jobs, so as to reduce the resource waste in big data enters. In particular, we base our analysis on Google cluster traces, consisting of a large number of big-data jobs with a high task fan-out. We first identify the time-varying patterns of failed jobs and the contributing system features. Based on our characterization study, we develop an on-line predictive model for job failures by applying various statistical learning techniques, namely Linear Discriminate Analysis (LDA), Quadratic Discriminate Analysis (QDA), and Logistic Regression (LR). Furthermore, we propose a delay-based mitigation policy which, after a certain grace period, proactively terminates the execution of jobs that are predicted to fail. The particular objective of postponing job terminations is to strike a good tradeoffs between resource waste and false prediction of successful jobs. Our evaluation results show that the proposed method is able to significantly reduce the resource waste by 41.9% on average, and keep false terminations of jobs low, i.e., only 1%.

dependable systems and networks | 2012

Model-driven consolidation of Java workloads on multicores

Danilo Ansaloni; Lydia Y. Chen; Evgenia Smirni; Walter Binder

Optimal resource allocation and application consolidation on modern multicore systems that host multiple applications is not easy. Striking a balance among conflicting targets such as maximizing system throughput and system utilization while minimizing application response times is a quandary for system administrators. The purpose of this work is to offer a methodology that can automate the difficult process of identifying how to best consolidate workloads in a multicore environment. We develop a simple approach that treats the hardware and the operating system as a black box and uses measurements to profile the application resource demands. The demands become input to a queueing network model that successfully predicts application scalability and that captures the performance impact of consolidated applications on shared on-chip and off-chip resources. Extensive analysis with the widely used DaCapo Java benchmarks on an IBM Power 7 system illustrates the models ability to accurately predict the systems optimal application mix.

cluster computing and the grid | 2012

Dynamic Replication in Service-Oriented Systems

Mathias Bjorkqvist; Lydia Y. Chen; Walter Binder

Service-oriented systems, consisting of atomic services and their compositions hosted in service composition execution engines (CEEs), are commonly deployed to deliver web applications. As the workloads of applications fluctuate over time, it is economical to autonomously and dynamically adjust system capacity, i.e., the number of replicas for atomic services and CEEs. In this paper, we propose a novel replica provisioning policy, Resos, which adjusts the number of CEE and service replicas periodically based on the predicted workloads such that all replicas are well utilized at the target values. In particular, Resos models the workload balance and dependency between CEE and service replicas by estimating the probability that threads of CEE replicas are not blocked by I/O. Moreover, we derive the analytical bounds of CEE effective utilization and illustrate the cause of low nominal utilization at CEE replicas. We evaluate Resos on a simulated service-oriented system, which hosts CEE and service replicas on multi-threaded servers. The evaluated workload is derived from utilization traces collected from production systems. Through simulation, we demonstrate that Resos effectively reduces the number of required replicas while maintaining target utilization and lowering the response times of requests.

Explore More