Renyu Yang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Renyu Yang is active.

Explore More

Publication

Featured researches published by Renyu Yang.

international symposium on autonomous decentralized systems | 2013

Improved energy-efficiency in cloud datacenters with interference-aware virtual machine placement

Ismael Solis Moreno; Renyu Yang; Jie Xu; Tianyu Wo

Virtualization is one of the main technologies used for improving resource efficiency in datacenters; it allows the deployment of co-existing computing environments over the same hardware infrastructure. However, the co-existing of environments — along with management inefficiencies — often creates scenarios of high-competition for resources between running workloads, leading to performance degradation. This phenomenon is known as Performance Interference, and introduces a non-negligible overhead that affects both a datacenters Quality of Service and its energy-efficiency. This paper introduces a novel approach to workload allocation that improves energy-efficiency in Cloud datacenters by taking into account their workload heterogeneity. We analyze the impact of performance interference on energy-efficiency using workload characteristics identified from a real Cloud environment, and develop a model that implements various decision-making techniques intelligently to select the best workload host according to its internal interference level. Our experimental results show reductions in interference by 27.5% and increased energy-efficiency up to 15% in contrast to current mechanisms for workload allocation.

very large data bases | 2014

Fuxi: a fault-tolerant resource management and job scheduling system at internet scale

Zhuo Zhang; Chao Li; Yangyu Tao; Renyu Yang; Hong Tang; Jie Xu

Scalability and fault-tolerance are two fundamental challenges for all distributed computing at Internet scale. Despite many recent advances from both academia and industry, these two problems are still far from settled. In this paper, we present Fuxi, a resource management and job scheduling system that is capable of handling the kind of workload at Alibaba where hundreds of terabytes of data are generated and analyzed everyday to help optimize the companys business operations and user experiences. We employ several novel techniques to enable Fuxi to perform efficient scheduling of hundreds of thousands of concurrent tasks over large clusters with thousands of nodes: 1) an incremental resource management protocol that supports multi-dimensional resource allocation and data locality; 2) user-transparent failure recovery where failures of any Fuxi components will not impact the execution of user jobs; and 3) an effective detection mechanism and a multi-level blacklisting scheme that prevents them from affecting job execution. Our evaluation results demonstrate that 95% and 91% scheduled CPU/memory utilization can be fulfilled under synthetic workloads, and Fuxi is capable of achieving 2.36T-B/minute throughput in GraySort. Additionally, the same Fuxi job only experiences approximately 16% slowdown under a 5% fault-injection rate. The slowdown only grows to 20% when we double the fault-injection rate to 10%. Fuxi has been deployed in our production environment since 2009, and it now manages hundreds of thousands of server nodes.

IEEE Internet Computing | 2017

Fog Orchestration for Internet of Things Services

Zhenyu Wen; Renyu Yang; Peter Garraghan; Tao Lin; Jie Xu; Michael Rovatsos

Large-scale Internet of Things (IoT) services such as healthcare, smart cities, and marine monitoring are pervasive in cyber-physical environments strongly supported by Internet technologies and fog computing. Complex IoT services are increasingly composed of sensors, devices, and compute resources within fog computing infrastructures. The orchestration of such applications can be leveraged to alleviate the difficulties of maintenance and enhance data security and system reliability. However, efficiently dealing with dynamic variations and transient operational behavior is a crucial challenge within the context of choreographing complex services. Furthermore, with the rapid increase of the scale of IoT deployments, the heterogeneity, dynamicity, and uncertainty within fog environments and increased computational complexity further aggravate this challenge. This article gives an overview of the core issues, challenges, and future research directions in fog-enabled orchestration for IoT services. Additionally, it presents early experiences of an orchestration scenario, demonstrating the feasibility and initial results of using a distributed genetic algorithm in this context.

IEEE Transactions on Services Computing | 2016

Straggler Root-Cause and Impact Analysis for Massive-scale Virtualized Cloud Datacenters

Peter Garraghan; Xue Ouyang; Renyu Yang; David McKee; Jie Xu

Increased complexity and scale of virtualized distributed systems has resulted in the manifestation of emergent phenomena substantially affecting overall system performance. This phenomena is known as “Long Tail”, whereby a small proportion of task stragglers significantly impede job completion time. While work focuses on straggler detection and mitigation, there is limited work that empirically studies straggler root-cause and quantifies its impact upon system operation. Such analysis is critical to ascertain in-depth knowledge of straggler occurrence for focusing developmental and research efforts towards solving the Long Tail challenge. This paper provides an empirical analysis of straggler root-cause within virtualized Cloud datacenters; we analyze two large-scale production systems to quantify the frequency and impact stragglers impose, and propose a method for conducting root-cause analysis. Results demonstrate approximately 5 percent of task stragglers impact 50 percent of total jobs for batch processes, and 53 percent of stragglers occur due to high server resource utilization. We leverage these findings to propose a method for extreme straggler detection through a combination of offline execution patterns modeling and online analytic agents to monitor tasks at runtime. Experiments show the approach is capable of detecting stragglers less than 11 percent into their execution lifecycle with 95 percent accuracy for short duration jobs.

service oriented software engineering | 2014

VMCSnap: Taking Snapshots of Virtual Machine Cluster with Memory Deduplication

Yumei Huang; Renyu Yang; Lei Cui; Tianyu Wo; Chunming Hu; Bo Li

Virtualization is one of the main technologies currently used to deploy computing systems due to the high reliability and rapid crash recovery it offers in comparison to physical nodes. These features are mainly achieved by continuously producing snapshots of the status of running virtual machines. In earlier works, the snapshot of each individual VM is performed independently, ignoring the memory similarities between VMs within the cluster. When the size of the virtual cluster becomes larger or snapshots are frequently taken, the size of snapshots can be extremely large, consuming large amount of storage space. In this paper, we introduce an innovative snapshot approach for virtual cluster that exploits shared memory pages among all the component VMs to reduce the size of produced snapshot and mitigate the I/O bottleneck. The duplicate memory pages are effectively discovered and stored only once when the snapshot is taken. In addition, our approach can be not only applied to the stop-copy snapshot but also to the pre-copy mechanism as well. Experiments on both snapshot methods are conducted and the result shows our method can reduce the total memory snapshot files by an average of 30% and reach 63% reduction of the snapshot time compared with the default KVM approach with little overhead of rollback time.

service oriented software engineering | 2016

Computing at Massive Scale: Scalability and Dependability Challenges

Renyu Yang; Jie Xu

Large-scale Cloud systems and big data analytics frameworks are now widely used for practical services and applications. However, with the increase of data volume, together with the heterogeneity of workloads and resources, and the dynamic nature of massive user requests, the uncertainties and complexity of resource management and service provisioning increase dramatically, often resulting in poor resource utilization, vulnerable system dependability, and user-perceived performance degradations. In this paper we report our latest understanding of the current and future challenges in this particular area, and discuss both existing and potential solutions to the problems, especially those concerned with system efficiency, scalability and dependability. We first introduce a data-driven analysis methodology for characterizing the resource and workload patterns and tracing performance bottlenecks in a massive-scale distributed computing environment. We then examine and analyze several fundamental challenges and the solutions we are developing to tackle them, including for example incremental but decentralized resource scheduling, incremental messaging communication, rapid system failover, and request handling parallelism. We integrate these solutions with our data analysis methodology in order to establish an engineering approach that facilitates the optimization, tuning and verification of massive-scale distributed systems. We aim to develop and offer innovative methods and mechanisms for future computing platforms that will provide strong support for new big data and IoE (Internet of Everything) applications.

ieee international conference on cloud computing technology and science | 2013

An Analysis of Performance Interference Effects on Energy-Efficiency of Virtualized Cloud Environments

Renyu Yang; Ismael Solis Moreno; Jie Xu; Tianyu Wo

Co-allocated workloads in a virtualized computing environment often have to compete for resources, thereby suffering from performance interference. While this phenomenon has a direct impact on the Quality of Service provided to customers, it also changes the patterns of resource utilization and reduces the amount of work per Watt consumed. Unfortunately, there has been only limited research into how performance interference affects energy-efficiency of servers in such environments. In reality, there is a highly dynamic and complicated correlation among resource utilization, performance interference and energy-efficiency. This paper presents a comprehensive analysis that quantifies the negative impact of performance interference on the energy-efficiency of virtualized servers. Our analysis methodology takes into account the heterogeneous workload characteristics identified from a real Cloud environment. In particular, we investigate the impact due to different workload type combinations and develop a method for approximating the levels of performance interference and energy-efficiency degradation. The proposed method is based on profiles of pair combinations of existing workload types and the patterns derived from the analysis. Our experimental results reveal a non-linear relationship between the increase in interference and the reduction in energy-efficiency as well as an average precision within +/-5% of error margin for the estimation of both parameters. These findings provide vital information for research into dynamic trade-offs between resource utilization, performance, and energy-efficiency of a data center.

IEEE Transactions on Services Computing | 2017

Reliable Computing Service in Massive-Scale Systems through Rapid Low-Cost Failover

Renyu Yang; Yang Zhang; Peter Garraghan; Yihui Feng; Jin Ouyang; Jie Xu; Zhuo Zhang; Chao Li

Large-scale distributed systems deployed as Cloud datacenters are capable of provisioning service to consumers with diverse business requirements. Providers face pressure to provision uninterrupted reliable services while reducing operational costs due to significant software and hardware failures. A widely adopted means to achieve such a goal is using redundant system components to implement user-transparent failover, yet its effectiveness must be balanced carefully without incurring heavy overhead when deployed—an important practical consideration for complex large-scale systems. Failover techniques developed for Cloud systems often suffer serious limitations, including mandatory restart leading to poor cost-effectiveness, as well as solely focusing on crash failures, omitting other important types, such as timing failures and simultaneous failures. This paper addresses these limitations by presenting a new approach to user-transparent failover for massive-scale systems. The approach uses soft-state inference to achieve rapid failure recovery and avoid unnecessary restart, with minimal system resource overhead. It also copes with different failures, including correlated and simultaneous events. The proposed approach was implemented, deployed and evaluated within Fuxi system, the underlying resource management system used within Alibaba Cloud. Results demonstrate that our approach tolerates complex failure scenarios while incurring at worst 228.5 microsecond instance overhead with 1.71 percent additional CPU usage.

high performance computing and communications | 2013

CloudAP: Improving the QoS of Mobile Applications with Efficient VM Migration

Yunkai Zhang; Renyu Yang; Tianyu Wo; Chunming Hu; Junbin Kang; Lei Cui

Mobile computing is increasingly growing in terms of massive computation as well as user demand and use of mobile devices. Remote execution techniques enrich the service experience of mobile devices by leveraging the resource pools of computation and storage capabilities of the cloud data center. However, the user experience and quality of service will be severely affected due to the inherent high latency and low bandwidth of a WAN environment. In this paper, we introduce a cloud base station CloudAP which is a small-scale computing infrastructure close to mobile users with local network access. In addition, we present a two-tier architecture consisting of CloudAP and Cloud center and show how to synthesize them to form a general computing environment. Furthermore, we propose a prompt execution environment migration scheme implemented by an efficient whole-system VM migration. It makes the execution environment move following the location of mobile device. Our experimental results demonstrate that the proposed architecture is effective and the execution environment migration scheme is efficient, consisting of up to 10ms and 30s for service downtime and execution environment switch time respectively. These improvements make vital contributions to user experience and QoS in mobile pervasive environment.

IOV 2015 Proceedings of the Second International Conference on Internet of Vehicles - Safe and Intelligent Mobility - Volume 9502 | 2015

A Method for Private Car Transportation Dispatching Based on a Passenger Demand Model

Wenbo Jiang; Tianyu Wo; Mingming Zhang; Renyu Yang; Jie Xu

Although the demand for taxis is increasing rapidly with the soaring population in big cities, the number of taxis grows relatively slowly during these years. In this context, private transportation such as Uber is emerging as a flexible business model, supplementary to the regular form of taxis. At present, much work mainly focuses on the reduction or minimization of taxi cruising miles. However, these taxi-based approaches have some limitations in the case of private car transportation because they do not fully utilize the order information available from the new type of business model. In this paper we present a dispatching method that reduces further the cruising mileage of private car transportation, based on a passenger demand model. In particular, we partition an urban area into many separate regions by using a spatial clustering algorithm and divide a day into several time slots according to the statistics of historical orders. Locally Weighted Linear Regression is adopted to depict the passenger demand model for a given region over a time slot. Finally, a dispatching process is formalized as a weighted bipartite graph matching problem and we then leverage our dispatching approach to schedule private vehicles. We assess our approach through several experiments using real datasets derived from a private car hiring company in China. The experimental results show that up to 74i?ź% accuracy could be achieved on passenger demand inference. Additionally, the conducted simulation tests demonstrate a 22.5i?ź% reduction of cruising mileage.

Explore More