Jidong Zhai
Tsinghua University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jidong Zhai.
acm sigplan symposium on principles and practice of parallel programming | 2010
Jidong Zhai; Wenguang Chen; Weimin Zheng
For designers of large-scale parallel computers, it is greatly desired that performance of parallel applications can be predicted at the design phase. However, this is difficult because the execution time of parallel applications is determined by several factors, including sequential computation time in each process, communication time and their convolution. Despite previous efforts, it remains an open problem to estimate sequential computation time in each process accurately and efficiently for large-scale parallel applications on non-existing target machines. This paper proposes a novel approach to predict the sequential computation time accurately and efficiently. We assume that there is at least one node of the target platform but the whole target system need not be available. We make two main technical contributions. First, we employ deterministic replay techniques to execute any process of a parallel application on a single node at real speed. As a result, we can simply measure the real sequential computation time on a target node for each process one by one. Second, we observe that computation behavior of processes in parallel applications can be clustered into a few groups while processes in each group have similar computation behavior. This observation helps us reduce measurement time significantly because we only need to execute representative parallel processes instead of all of them. We have implemented a performance prediction framework, called PHANTOM, which integrates the above computation-time acquisition approach with a trace-driven network simulator. We validate our approach on several platforms. For ASCI Sweep3D, the error of our approach is less than 5% on 1024 processor cores. Compared to a recent regression-based prediction approach, PHANTOM presents better prediction accuracy across different platforms.
ieee international conference on high performance computing data and analytics | 2011
Yan Zhai; Mingliang Liu; Jidong Zhai; Xiaosong Ma; Wenguang Chen
The emergence of cloud services brings new possibilities for constructing and using HPC platforms. However, while cloud services provide the flexibility and convenience of customized, pay-as-you-go parallel computing, multiple previous studies in the past three years have indicated that cloud-based clusters need a significant performance boost to become a competitive choice, especially for tightly coupled parallel applications. In this work, we examine the feasibility of running HPC applications in clouds. This study distinguishes itself from existing investigations in several ways: 1) We carry out a comprehensive examination of issues relevant to the HPC community, including performance, cost, user experience, and range of user activities. 2) We compare an Amazon EC2-based platform built upon its newly available HPC-oriented virtual machines with typical local cluster and supercomputer options, using benchmarks and applications with scale and problem size unprecedented in previous cloud HPC studies. 3) We perform detailed performance and scalability analysis to locate the chief limiting factors of the state-of-the-art cloud based clusters. 4) We present a case study on the impact of per-application parallel I/O system configuration uniquely enabled by cloud services. Our results reveal that though the scalability of EC2-based virtual clusters still lags behind traditional HPC alternatives, they are rapidly gaining in overall performance and cost-effectiveness, making them feasible candidates for performing tightly coupled scientific computing. In addition, our detailed benchmarking and profiling discloses and analyzes several problems regarding the performance and performance stability on EC2.
ieee international conference on high performance computing data and analytics | 2013
Shuangcheng Niu; Jidong Zhai; Xiaosong Ma; Xiongchao Tang; Wenguang Chen
Recent studies have found cloud environments increasingly appealing for executing HPC applications, including tightly coupled parallel simulations. While public clouds offer elastic, on-demand resource provisioning and pay-as-you-go pricing, individual users setting up their on-demand virtual clusters may not be able to take full advantage of common cost-saving opportunities, such as reserved instances. In this paper, we propose a Semi-Elastic Cluster (SEC) computing model for organizations to reserve and dynamically resize a virtual cloud-based cluster. We present a set of integrated batch scheduling plus resource scaling strategies uniquely enabled by SEC, as well as an online reserved instance provisioning algorithm based on job history. Our trace-driven simulation results show that such a model has a 61.0% cost saving than individual users acquiring and managing cloud resources without causing longer average job wait time. Meanwhile, the overhead of acquiring/maintaining shared cloud instances is shown to take only a few seconds.
ieee international conference on high performance computing data and analytics | 2009
Jidong Zhai; Tianwei Sheng; Jiangzhou He; Wenguang Chen; Weimin Zheng
A proper understanding of communication patterns of parallel applications is important to optimize application performance and design better communication subsystems. Communication patterns can be obtained by analyzing communication traces. However, existing approaches to generate communication traces need to execute the entire parallel applications on full-scale systems that are time-consuming and expensive. In this paper, we propose a novel technique, called Fact, which can perform FAst Communication Trace collection for large-scale parallel applications on small-scale systems. Our idea is to reduce the original program to obtain a program slice through static analysis, and to execute the program slice to acquire the communication traces. The program slice preserves all the variables and statements in the original program relevant to spatial and volume communication attributes. Our idea is based on an observation that most computation and message contents in message-passing parallel applications are independent of these attributes, and therefore can be removed from the programs for the purpose of communication trace collection. We have implemented Fact and evaluated it with NPB programs and Sweep3D. The results show that Fact can preserve the spatial and volume communication attributes of original programs and reduce resource consumptions by two orders of magnitude in most cases. For example, Fact collects the communication traces of the Sweep3D for 512 processes on a 4-node (32 cores) platform in just 6.79 seconds, consuming 1.25 GB memory, while the original program takes 256.63 seconds and consumes 213.83 GB memory on a 32-node (512 cores) platform. Finally, we present an application of Fact.
european conference on parallel processing | 2009
Jin Zhang; Jidong Zhai; Wenguang Chen; Weimin Zheng
It is an important problem to map virtual parallel processes to physical processors (or cores) in an optimized way to get scalable performance due to non-uniform communication cost in modern parallel computers. Existing work uses profile-guided approaches to optimize mapping schemes to minimize the cost of point-to-point communications automatically. However, these approaches cannot deal with collective communications and may get sub-optimal mappings for applications with collective communications. In this paper, we propose an approach called OPP (Optimized Process Placement) to handle collective communications which transforms collective communications into a series of point-to-point communication operations according to the implementation of collective communications in communication libraries. Then we can use existing approaches to find optimized mapping schemes which are optimized for both point-to-point and collective communications. We evaluated the performance of our approach with micro-benchmarks which include all MPI collective communications, NAS Parallel Benchmark suite and three other applications. Experimental results show that the optimized process placement generated by our approach can achieve significant speedup.
Science in China Series F: Information Sciences | 2009
Wenguang Chen; Jidong Zhai; Jin Zhang; Weimin Zheng
Message passing interface (MPI) is the de facto standard in writing parallel scientific applications on distributed memory systems. Performance prediction of MPI programs on current or future parallel systems can help to find system bottleneck or optimize programs. To effectively analyze and predict performance of a large and complex MPI program, an efficient and accurate communication model is highly needed. A series of communication models have been proposed, such as the LogP model family, which assume that the sending overhead, message transmission, and receiving overhead of a communication is not overlapped and there is a maximum overlap degree between computation and communication. However, this assumption does not always hold for MPI programs because either sending or receiving overhead introduced by MPI implementations can decrease potential overlap for large messages. In this paper, we present a new communication model, named LogGPO, which captures the potential overlap between computation with communication of MPI programs. We design and implement a trace-driven simulator to verify the LogGPO model by predicting performance of point-to-point communication and two real applications CG and Sweep3D. The average prediction errors of LogGPO model are 2.4% and 2.0% for these two applications respectively, while the average prediction errors of LogGP model are 38.3% and 9.1% respectively.
IEEE Transactions on Parallel and Distributed Systems | 2011
Jidong Zhai; Tianwei Sheng; Jiangzhou He; Wenguang Chen; Weiming Zheng
Communication patterns of parallel applications are important to optimize application performance and design better communication subsystems. Communication patterns can be extracted from communication traces. However, existing approaches to generate communication traces need to execute the entire parallel applications on full-scale systems that are time consuming and expensive. We propose a novel technique, called Fact, which can perform FAst Communication Traces collection for large-scale parallel applications on small-scale systems. Our idea is to reduce the original program to obtain a program slice through static analysis, and to execute the program slice to acquire the communication traces. Our idea is based on an observation that most computation and message contents in parallel applications are not relevant to their spatial and volume communication attributes, and therefore can be removed for the purpose of communication trace collection. We have implemented Fact and evaluated it with NPB programs and Sweep3D. The results show that Fact can reduce resource consumptions by two orders of magnitude in most cases.
job scheduling strategies for parallel processing | 2012
Shuangcheng Niu; Jidong Zhai; Xiaosong Ma; Mingliang Liu; Yan Zhai; Wenguang Chen; Weimin Zheng
The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time.
asia pacific workshop on systems | 2011
Mingliang Liu; Jidong Zhai; Yan Zhai; Xiaosong Ma; Wenguang Chen
There is a trend to migrate HPC (High Performance Computing) applications to cloud platforms, such as the Amazon EC2 Cluster Compute Instances (CCIs). While existing research has mainly focused on the performance impact of virtualized environments and interconnect technologies on parallel programs, we suggest that the configurability enabled by clouds is another important dimension to explore. Unlike on traditional HPC platforms, on a cloud-resident virtual cluster it is easy to change the I/O configurations, such as the choice of file systems, the number of I/O nodes, and the types of virtual disks, to fit the I/O requirements of different applications. In this paper, we discuss how cloud platforms can be employed to form customized and balanced I/O subsystems for individual I/O-intensive MPI applications. Through our preliminary evaluation, we demonstrate that different applications will benefit from individually tailored I/O system configurations. For a given I/O-intensive application, different I/O settings may lead to significant overall application performance or cost difference (up to 2.5-fold). Our exploration indicates that customized system configuration for HPC applications in the cloud is important and non-trivial.
symposium on code generation and optimization | 2017
Feng Zhang; Bo Wu; Jidong Zhai; Bingsheng He; Wenguang Chen
The integrated architecture that features both CPU and GPU on the same die is an emerging and promising architecture for fine-grained CPU-GPU collaboration. However, the integration also brings forward several programming and system optimization challenges, especially for irregular applications. The complex interplay between heterogeneity and irregularity leads to very low processor utilization of running irregular applications on integrated architectures. Furthermore, fine-grained co-processing on the CPU and GPU is still an open problem. Particularly, in this paper, we show that the previous workload partitioning for CPU-GPU coprocessing is far from ideal in terms of resource utilization and performance. To solve this problem, we propose a system software named FinePar, which considers architectural differences of the CPU and GPU and leverages finegrained collaboration enabled by integrated architectures. Through irregularity-aware performance modeling and online auto-tuning, FinePar partitions irregular workloads and achieves both device-level and thread-level load balance. We evaluate FinePar with 8 irregular applications on an AMD integrated architecture and compare it with state-of-the-art partitioning approaches. Results show that FinePar demonstrates better resource utilization and achieves an average of 1.38X speedup over the optimal coarse-grained partitioning method.