Hailong Yang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hailong Yang is active.

Explore More

Publication

Featured researches published by Hailong Yang.

grid computing | 2012

MapReduce Workload Modeling with Statistical Approach

Hailong Yang; Zhongzhi Luan; Wenjun Li; Depei Qian

Large-scale data-intensive cloud computing with the MapReduce framework is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop, a state-of-the-art open source project, is by far the most successful realization of MapReduce framework. While MapReduce is easy- to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop impose unexpected challenges on running various workloads with a Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, either because they have no idea how these configurations would influence the performance, or because they are not even aware that these configurations exist. There is a pressing need for comprehensive analysis and performance modeling to ease MapReduce application development and guide performance optimization under different Hadoop configurations. In this paper, we propose a statistical analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. We apply principal component analysis and cluster analysis to 45 different metrics, which derive relationships between workload characteristics and corresponding performance under different Hadoop configurations. Regression models are also constructed that attempt to predict the performance of various workloads under different Hadoop configurations. Several non-intuitive relationships between workload characteristics and performance are revealed through our analysis and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.

Future Generation Computer Systems | 2014

iMeter: An integrated VM power model based on performance profiling

Hailong Yang; Qi Zhao; Zhongzhi Luan; Depei Qian

Abstract The unprecedented burst in power consumption encountered by contemporary datacenters continually boosts the development of energy efficient techniques from both hardware and software perspectives to alleviate the energy problem. The most widely adopted power saving solutions in datacenters that deliver cloud computing services are power capping and VM consolidation. However, without the capability to track the VM power usage precisely, the combined effect of the above two techniques could cause severe performance degradation to the consolidated VMs, thus violating the user service level agreements. In this paper, we propose an integrated VM power model called iMeter, which overcomes the drawbacks of overpresumption and overapproximation in segregated power models used in previous studies. We leverage the kernel-based performance counters that provide accurate performance statistics as well as high portability across heterogeneous platforms to build the VM power model. Principal component analysis is applied to identify performance counters that show strong impact on the VM power consumption with mathematical confidence. We also present a brief interpretation of the first four selected principal components on their indications of VM power consumption. We demonstrate that our approach is independent of underlying hardware and virtualization configurations with clustering analysis. We utilize the support vector regression to build the VM power model predicting the power consumption of both a single VM and multiple consolidated VMs running various workloads. The experimental results show that our model is able to predict the instantaneous VM power usage with an average error of 5% and 4.7% respectively against the actual power measurement.

Operating Systems Review | 2016

Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers

Quan Chen; Hailong Yang; Jason Mars; Lingjia Tang

Modern warehouse-scale computers (WSCs) are being outfitted with accelerators to provide the significant compute required by emerging intelligent personal assistant (IPA) workloads such as voice recognition, image classification, and natural language processing. It is well known that the diurnal user access pattern of user-facing services provides a strong incentive to co-locate applications for better accelerator utilization and efficiency, and prior work has focused on enabling co-location on multicore processors. However, interference when co-locating applications on non-preemptive accelerators is fundamentally different than contention on multi-core CPUs and introduces a new set of challenges to reduce QoS violation. To address this open problem, we first identify the underlying causes for QoS violation in accelerator-outfitted servers. Our experiments show that queuing delay for the compute resources and PCI-e bandwidth contention for data transfer are the main two factors that contribute to the long tails of user-facing applications. We then present Baymax, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization. Using DjiNN, a deep neural network service, Sirius, an end-to-end IPA workload, and traditional applications on a Nvidia K40 GPU, our evaluation shows that Baymax improves the accelerator utilization by 91.3% while achieving the desired 99%-ile latency target for for user-facing applications. In fact, Baymax reduces the 99%-ile latency of user-facing applications by up to 195x over default execution.

international parallel and distributed processing symposium | 2012

Statistics-based Workload Modeling for MapReduce

Hailong Yang; Zhongzhi Luan; Wenjun Li; Depei Qian; Gang Guan

Large-scale data-intensive computing with MapReduce framework in Cloud is becoming pervasive for the core business of many academic, government, and industrial organizations. Hadoop is by far the most successful realization of MapReduce framework. While MapReduce is easy-to-use, efficient and reliable for data-intensive computations, the excessive configuration parameters in Hadoop cause unexpected challenges when running various workloads with Hadoop cluster effectively. Consequently, developers who have less experience with the Hadoop configuration system may devote a significant effort to write an application with poor performance, because they have no idea how these configurations would influence the performance, or they are not even aware that these configurations exist. In this paper, we propose a statistic analysis approach to identify the relationships among workload characteristics, Hadoop configurations and workload performance. Several non-intuitive relationships between workload characteristics and relative performance are revealed and the experimental results demonstrate that our regression models accurately predict the performance of MapReduce workloads under different Hadoop configurations.

Advances in Experimental Medicine and Biology | 2010

GPU Acceleration of Dock6’s Amber Scoring Computation

Hailong Yang; Qiongqiong Zhou; Bo Li; Yongjian Wang; Zhongzhi Luan; Depei Qian; Hanlu Li

Dressing the problem of virtual screening is a long-term goal in the drug discovery field, which if properly solved, can significantly shorten new drugs’ R&D cycle. The scoring functionality that evaluates the fitness of the docking result is one of the major challenges in virtual screening. In general, scoring functionality in docking requires a large amount of floating-point calculations, which usually takes several weeks or even months to be finished. This time-consuming procedure is unacceptable, especially when highly fatal and infectious virus arises such as SARS and H1N1, which forces the scoring task to be done in a limited time. This paper presents how to leverage the computational power of GPU to accelerate Dock6’s (http://dock.compbio.ucsf.edu/DOCK_6/) Amber (J. Comput. Chem. 25: 1157–1174, 2004) scoring with NVIDIA CUDA (NVIDIA Corporation Technical Staff, Compute Unified Device Architecture – Programming Guide, NVIDIA Corporation, 2008) (Compute Unified Device Architecture) platform. We also discuss many factors that will greatly influence the performance after porting the Amber scoring to GPU, including thread management, data transfer, and divergence hidden. Our experiments show that the GPU-accelerated Amber scoring achieves a 6.5× speedup with respect to the original version running on AMD dual-core CPU for the same problem size. This acceleration makes the Amber scoring more competitive and efficient for large-scale virtual screening problems.

high performance computing and communications | 2015

Request Squeezer: Mitigating Tail Latency through Pruned Request Replication

Zuowei Zhang; Hailong Yang; Zhongzhi Luan; Depei Qian

Modern Internet services are computing over large scale dataset and responding to user requests instantly. To deliver satisfactory user experience, the tail latency of the services ought to be man-aged within the service level agreement (SLA). Existing techniques for mitigating tail latency launch multi replicas of each request to different machines and use the result of the one that finishes first. However, depending on the system utilization, a portion of replicas violates SLA even before running and thus wastes resource unnecessarily when executed. These unnecessary replicas further delays more subsequent replicas, dragging the tail latency below SLA target, especially under high system utilization. We present Request Squeezer, a methodology for mitigating tail latency through pruned request replication. For each replica, Request Squeezer leverages the queuing and service time to predict the latency and terminates the replica missing the SLA target. In case that all replicas are pruned, certain replicas are marked as survivor, which are immune to pruning technique. Evaluation with Google Web Search workload shows that our approach saves 11.6% of resource and improves 25.9% of the maximum throughput while meeting the same SLA compared with the state-of-the-art request replication techniques.

network-based information systems | 2012

Efficient Statistical Computing on Multicore and MultiGPU Systems

Yulong Ou; Bo Li; Hailong Yang; Zhongzhi Luan; Depei Qian

As a statistical programming language for data analysis with a powerful graphics toolkit, R has been widely used in mathematical computing, biology simulation and medicine research. For large-scale computing such as drug discovery and protein folding, R is not good enough since it usually runs on a desktop computer. The situation gets worse when R runs on a single machine, while other computing is done on a cluster or even a supercomputer. In this paper, a parallel computing schema was proposed that R running on both CPU and GPU clusters, which have shown high multi-threaded performance while enabling high parallelism with lower energy consuming. The three statistical algorithms: chi-squared distribution, Pearson correlation coefficient and unary linear regression model were rewritten. Evaluation shows that our implementation exhibits superior performance and energy-efficiency than the single-threaded competitors. For instance, when the size of input dataset reaches 400M, the MPI implementation of the chi-squared distribution on a cluster with four nodes achieves a speedup of nearly 20x, while the CUDA implementation achieves a speedup of 5.2x on a single-GPU, and more than 15x on a system with three GPUs.

international conference on parallel processing | 2012

UVMPM: A Unitary Approach for VM Power Metering Based on Performance Profiling

Hailong Yang; Qi Zhao; Zhongzhi Luan; Depei Qian

Energy consumption in contemporary data centers is continuously surging in an unsustainable way, and tremendous efforts have been devoted both from industry and academia to address this issue. The pervasive utilization of virtual machines (VMs) in Cloud computing offers the service provider an opportunity to consolidate various workloads onto fewer servers, thus improving resource utilization as well as eliminating idle power consumption. Meanwhile, over-provisioning of VMs on individual servers may cause more power to be consumed by the cooling systems and imbalance on power distribution systems, which increases the possibility of system failures due to overheating. Therefore, power capping is widely adopted to reinforce the upper-bound power consumption of a single server taking the advantage of dynamic frequency voltage scaling (DVFS) technology which is extensively supported by the server manufacturers.

conference on decision and control | 2011

CDebugger: A scalable parallel debugger with dynamic communication topology configuration

Juncheng Zhang; Zhongzhi Luan; Wenjun Li; Hailong Yang; Junjie Ni; Yuanqiang Huang; Depei Qian

With cloud computing to take off, the application range of cloud computing has been expanded significantly. Not limited to SOA applications, some high-performance parallel applications have been resident on cloud platform. Users can take advantages of large volume of resources and perform compute and data intensive parallel applications over clouds according to their applications demand without deploying actual physical infrastructure. In this context, tracing and debugging parallel applications become a pressing requirement in clouds. At present, there are still no widely adopted software tools to enable a robust parallel debugging environment for cloud computing. Considering the fact that the scale of parallel applications can be very large and a mass number of compute nodes are needed, we design a new parallel debugger named CDebugger with scalable communication and startup mechanism using the tree-based network in this paper. At the same time, a dynamic communication topology configuration algorithm is proposed to maximize debugging performance and reduce response time. The evaluation results reveal our parallel debugger with good scalability regarding to various application scales.

international conference on algorithms and architectures for parallel processing | 2013

POIGEM: A Programming-Oriented Instruction Level GPU Energy Model for CUDA Program

Qi Zhao; Hailong Yang; Zhongzhi Luan; Depei Qian

GPU architectures tend to be increasingly important in multi-core era nowadays due to their formidable computational horsepower. With the assistant of effective programming paradigms as CUDA, GPUs are widely adopted to accelerate scientific applications. Meanwhile, the surging energy consumption by GPUs becomes a major challenge to both GPU architects and programmers. In addition to the efforts designing energy efficient GPU architecture, comprehensive understanding on how programming affects the energy consumption of GPU application is also indispensable from the programmer perspective. In this paper, we present a programming-oriented PTX instruction level energy model to provide programmers the ability of predicting the energy consumption of their program. Distinct from previous models which require hardware performance counters or architectural simulations, our model relies on the PTX instruction of a CUDA program which is not only portable but also accurate. With the selected PTX instructions based on empirical study, we apply linear regression to build the GPU energy model. One appealing advantage of our model is that it does not require any instrumentation or profiling of the GPU application during execution. Actually, our model is able to advise the programmers step by step to illustrate how their way of programming impacts the final energy consumption, especially at the stage of hacking the codes. Our model is evaluated on NVIDIA GeForce GTX 470 with Rodinia benchmark suites. The results show the accuracy of our model is promising with average prediction error below 3.7%. With the help of our GPU energy model, the programmers are gaining valuable insights to improve the energy efficiency of the application.

Explore More