Peter Garraghan
University of Leeds
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Peter Garraghan.
ieee international conference on cloud computing technology and science | 2014
Ismael Solis Moreno; Peter Garraghan; Paul Townend; Jie Xu
Understanding the characteristics and patterns of workloads within a Cloud computing environment is critical in order to improve resource management and operational conditions while Quality of Service (QoS) guarantees are maintained. Simulation models based on realistic parameters are also urgently needed for investigating the impact of these workload characteristics on new system designs and operation policies. Unfortunately there is a lack of analyses to support the development of workload models that capture the inherent diversity of users and tasks, largely due to the limited availability of Cloud tracelogs as well as the complexity in analyzing such systems. In this paper we present a comprehensive analysis of the workload characteristics derived from a production Cloud data center that features over 900 users submitting approximately 25 million tasks over a time period of a month. Our analysis focuses on exposing and quantifying the diversity of behavioral patterns for users and tasks, as well as identifying model parameters and their values for the simulation of the workload created by such components. Our derived model is implemented by extending the capabilities of the CloudSim framework and is further validated through empirical comparison and statistical hypothesis tests. We illustrate several examples of this works practical applicability in the domain of resource management and energy-efficiency.
service oriented software engineering | 2013
Ismael Solis Moreno; Peter Garraghan; Paul Townend; Jie Xu
Analyzing behavioral patterns of workloads is critical to understanding Cloud computing environments. However, until now only a limited number of real-world Cloud data center trace logs have been available for analysis. This has led to a lack of methodologies to capture the diversity of patterns that exist in such datasets. This paper presents the first large-scale analysis of real-world Cloud data, using a recently released dataset that features traces from over 12,000 servers over the period of a month. Based on this analysis, we develop a novel approach for characterizing workloads that for the first time considers Cloud workload in the context of both user and task in order to derive a model to capture resource estimation and utilization patterns. The derived model assists in understanding the relationship between users and tasks within workload, and enables further work such as resource optimization, energy-efficiency improvements, and failure correlation. Additionally, it provides a mechanism to create patterns that randomly fluctuate based on realistic parameters. This is critical to emulating dynamic environments instead of statically replaying records in the trace log. Our approach is evaluated by contrasting the logged data against simulation experiments, and our results show that the derived model parameters correctly describe the operational environment within a 5% of error margin, confirming the great variability of patterns that exist in Cloud computing.
IEEE Transactions on Emerging Topics in Computing | 2014
Peter Garraghan; Ismael Solis Moreno; Paul Townend; Jie Xu
Cloud computing providers are under great pressure to reduce operational costs through improved energy utilization while provisioning dependable service to customers; it is therefore extremely important to understand and quantify the explicit impact of failures within a system in terms of energy costs. This paper presents the first comprehensive analysis of the impact of failures on energy consumption in a real-world large-scale cloud system (comprising over 12 500 servers), including the study of failure and energy trends of the spatial and temporal environmental characteristics. Our results show that 88% of task failure events occur in lower priority tasks producing 13% of total energy waste, and 1% of failure events occur in higher priority tasks due to server failures producing 8% of total energy waste. These results highlight an unintuitive but significant impact on energy consumption due to failures, providing a strong foundation for research into dependable energy-aware cloud computing.
IEEE Internet Computing | 2017
Zhenyu Wen; Renyu Yang; Peter Garraghan; Tao Lin; Jie Xu; Michael Rovatsos
Large-scale Internet of Things (IoT) services such as healthcare, smart cities, and marine monitoring are pervasive in cyber-physical environments strongly supported by Internet technologies and fog computing. Complex IoT services are increasingly composed of sensors, devices, and compute resources within fog computing infrastructures. The orchestration of such applications can be leveraged to alleviate the difficulties of maintenance and enhance data security and system reliability. However, efficiently dealing with dynamic variations and transient operational behavior is a crucial challenge within the context of choreographing complex services. Furthermore, with the rapid increase of the scale of IoT deployments, the heterogeneity, dynamicity, and uncertainty within fog environments and increased computational complexity further aggravate this challenge. This article gives an overview of the core issues, challenges, and future research directions in fog-enabled orchestration for IoT services. Additionally, it presents early experiences of an orchestration scenario, demonstrating the feasibility and initial results of using a distributed genetic algorithm in this context.
high assurance systems engineering | 2014
Peter Garraghan; Paul Townend; Jie Xu
Cloud computing research is in great need of statistical parameters derived from the analysis of real-world systems. One aspect of this is the failure characteristics of Cloud environments composed of workloads and servers, currently, few metrics are available that quantify failure and repair times of workloads and servers at a large-scale. Workload metrics in particular are critical for characterizing and modeling accurate workload behavior, enabling more realistic workload simulation and failure scenarios of systems. This paper presents the analysis of failure data of a large-scale production Cloud environment (consisting of over 12,500 servers), and includes a study of failure and repair times and characteristics for both Cloud workloads and servers. Our results show that failure characteristics for workload and servers are highly variable and that production Cloud workloads can be accurately modeled by a Gamma distribution. Repair times range between 30 seconds to 4 days, and 25 minutes to 8 days, for workloads and servers respectively.
ieee international conference on cloud engineering | 2013
Peter Garraghan; Paul Townend; Jie Xu
Understanding the resource utilization and server characteristics of large-scale systems is crucial if service providers are to optimize their operations whilst maintaining Quality of Service. For large-scale data enters, identifying the characteristics of resource demand and the current availability of such resources, allows system managers to design and deploy mechanisms to improve data enter utilization and meet Service Level Agreements with their customers, as well as facilitating business expansion. In this paper, we present a large-scale analysis of server resource utilization and a characterization of a production Cloud data enter using the most recent data enter trace logs made available by Google. We present their statistical properties, and a comprehensive coarse-grain analysis of the data, including submission rates, server classification, and server resource utilization. Additionally, we perform a fine-grained analysis to quantify the resource utilization of servers wasted due to the early termination of tasks. Our results show that data enter resource utilization remains relatively stable at between 40 - 60%, that the degree of correlation between server utilization and Cloud workload environment varies by server architecture, and that the amount of resource utilization wasted varies between 4.53 - 14.22% for different server architectures. This provides invaluable real-world empirical data for Cloud researchers in many subject areas.
service oriented software engineering | 2014
Hussain Aljahdali; Abdulaziz Albatli; Peter Garraghan; Paul Townend; Lydia Lau; Jie Xu
As Cloud Computing becomes the trend of information technology computational model, the Cloud security is becoming a major issue in adopting the Cloud where security is considered one of the most critical concerns for the large customers of Cloud (i.e. governments and enterprises). Such valid concern is mainly driven by the Multi-Tenancy situation which refers to resource sharing in Cloud Computing and its associated risks where confidentiality and/or integrity could be violated. As a result, security concerns may harness the advancement of Cloud Computing in the market. So, in order to propose effective security solutions and strategies a good knowledge of the current Cloud implementations and practices, especially the public Clouds, must be understood by professionals. Such understanding is needed in order to recognize attack vectors and attack surfaces. In this paper we will propose an attack model based on a threat model designed to take advantage of Multi-Tenancy situation only. Before that, a clear understanding of Multi-Tenancy, its origin and its benefits will be demonstrated. Also, a novel way on how to approach Multi-Tenancy will be illustrated. Finally, we will try to sense any suspicious behavior that may indicate to a possible attack where we will try to recognize the proposed attack model empirically from Google trace logs. Google trace logs are a 29-day worth of data released by Google. The data set was utilized in reliability and power consumption studies, but not been utilized in any security study to the extent of our knowledge.
IEEE Transactions on Services Computing | 2016
Peter Garraghan; David McKee; Xue Ouyang; David Webster; Jie Xu
Simulation is critical when studying real operational behavior of increasingly complex Cyber-Physical Systems, forecasting future behavior, and experimenting with hypothetical scenarios. A critical aspect of simulation is the ability to evaluate large-scale systems within a reasonable time frame while modeling complex interactions between millions of components. However, modern simulations face limitations in provisioning this functionality for CPSs in terms of balancing simulation complexity with performance, resulting in substantial operational costs required for completing simulation execution. Moreover, users are required to have expertise in modeling and configuring simulations to infrastructure which is time consuming. In this paper we present Simulation EnvironmEnt Distributor (SEED), a novel approach for simulating large-scale CPSs across a loosely-coupled distributed system requiring minimal user configuration. This is achieved through automated simulation partitioning and instantiation while enforcing tight event messaging across the system. SEED operates efficiently within both small and large-scale OTS hardware, agnostic of cluster heterogeneity and OS running, and is capable of simulating the full system and network stack of a CPS. Our approach is validated through experiments conducted in a cluster to simulate CPS operation. Results demonstrate that SEED is capable of simulating CPSs containing 2,000,000 tasks across 2,000 nodes with only 6.89× slow down relative to real time, and executes effectively across distributed infrastructure.
IEEE Transactions on Services Computing | 2016
Peter Garraghan; Xue Ouyang; Renyu Yang; David McKee; Jie Xu
Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5 percent of task stragglers impact 50 percent of total jobs for batch processes, and 53 percent of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11 percent into their execution lifecycle with 95 percent accuracy for short duration jobs.
international symposium on autonomous decentralized systems | 2015
Jemishkumar Patel; Vasu Jindal; I-Ling Yen; Farokh B. Bastani; Jie Xu; Peter Garraghan
In cloud computing, good resource management can benefit both cloud users as well as cloud providers. Workload prediction is a crucial step towards achieving good resource management. While it is possible to estimate the workloads of long-running tasks based on the periodicity in their historical workloads, it is difficult to do so for tasks which do not have such recurring workload patterns. In this paper, we present an innovative clustering based resource estimation approach which groups tasks that have similar characteristics into the same cluster. The historical workload data for tasks in a cluster are used to estimate the resources needed by new tasks based on the cluster(s) to which they belong. In particular, for a new task T, we measure Ts initial workload and predict to which cluster(s) it may belong. Then, the workload information of the cluster(s) is used to estimate the workload of T. The approach is experimentally evaluated using Google dataset, including resource usage data of over half a million tasks. We develop a workload model based on the dataset which is then used to estimate the workload patterns of several randomly selected tasks from the trace log. The results confirm the effectiveness of this cluster-based method for estimating the resources required by each task.