Ruini Xue
University of Electronic Science and Technology of China
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ruini Xue.
acm sigplan symposium on principles and practice of parallel programming | 2009
Ruini Xue; Xuezheng Liu; Ming Wu; Zhengyu Guo; Wenguang Chen; Weimin Zheng; Zheng Zhang; Geoffrey M. Voelker
Message Passing Interface (MPI) is a widely used standard for managing coarse-grained concurrency on distributed computers. Debugging parallel MPI applications, however, has always been a particularly challenging task due to their high degree of concurrent execution and non-deterministic behavior. Deterministic replay is a potentially powerful technique for addressing these challenges, with existing MPI replay tools adopting either data-replay or order-replay approaches. Unfortunately, each approach has its tradeoffs. Data-replay generates substantial log sizes by recording every communication message. Order-replay generates small logs, but requires all processes to be replayed together. We believe that these drawbacks are the primary reasons that inhibit the wide adoption of deterministic replay as the critical enabler of cyclic debugging of MPI applications. This paper describes subgroup reproducible replay (SRR), a hybrid deterministic replay method that provides the benefits of both data-replay and order-replay while balancing their trade-offs. SRR divides all processes into disjoint groups. It records the contents of messages crossing group boundaries as in data-replay, but records just message orderings for communication within a group as in order-replay. In this way, SRR can exploit the communication locality of traffic patterns in MPI applications. During replay, developers can then replay each group individually. SRR reduces recording overhead by not recording intra-group communication, and reduces replay overhead by limiting the size of each replay group. Exposing these tradeoffs gives the user the necessary control for making deterministic replay practical for MPI applications. We have implemented a prototype, MPIWiz, to demonstrate and evaluate SRR. MPIWiz employs a replay framework that allows transparent binary instrumentation of both library and system calls. As a result, MPIWiz replays MPI applications with no source code modification and relinking, and handles non-determinism in both MPI and OS system calls. Our preliminary results show that MPIWiz can reduce recording overhead by over a factor of four relative to data-replay, yet without requiring the entire application to be replayed as in order-replay. Recording increases execution time by 27% while the application can be replayed in just 53% of its base execution time.
IEEE Transactions on Services Computing | 2015
Yong Zhao; Youfu Li; Ioan Raicu; Shiyong Lu; Cui Lin; Yanzhe Zhang; Wenhong Tian; Ruini Xue
Cloud computing is an emerging computing paradigm that can offer unprecedented scalability and resources on demand, and is getting more and more adoption in the science community, while scientific workflow management systems provide essential support such as management of data and task dependencies, job scheduling and execution, provenance tracking, etc., to scientific computing. As we are entering into a “big data” era, it is imperative to migrate scientific workflow management systems into the cloud to manage the ever increasing data scale and analysis complexity. We propose a reference service framework for integrating scientific workflow management systems into various cloud platforms, which consists of eight major components, including Cloud Workflow Management Service, Cloud Resource Manager, etc., and six interfaces between them. We also present a reference framework for the implementation of Cloud Resource Manager, which is responsible for the provisioning and management of virtual resources in the cloud. We discuss our implementation of the framework by integrating the Swift scientific workflow management system with the OpenNebula and Eucalyptus cloud platforms, and demonstrate the capability of the solution using a NASA MODIS image processing workflow and a production deployment on the Science@Guoshi network with support for the Montage image mosaic workflow.
grid computing | 2012
Yong Zhao; Yanzhe Zhang; Wenhong Tian; Ruini Xue; Cui Lin
Scientific applications are growing rapidly in both data scale and processing complexity due to advances in science instrumentation and network technologies, where Cloud computing as an emerging computing paradigm can offer unprecedented scalability and resources on demand, and is getting more and more adoption in the science community. We present our early effort in designing and building CloudDragon, a scientific computing Cloud platform based on OpenNebula. We take a structured approach that integrates client-side application specification and testing, service-based workflow submission and management, on-demand virtual cluster provisioning, high-throughput task scheduling and execution, and efficient and scalable Cloud resource management. We first analyze the integration efficiency of our approach in a cluster setting and then show a production deployment of the platform.
international conference on cloud and green computing | 2012
Yong Zhao; Youfu Li; Wenhong Tian; Ruini Xue
Scientific workflow management systems have been around for many years and provide essential support such as management of data and task dependencies, job scheduling and execution, provenance tracking, etc. to scientific computing. While we are entering into a “big data” era, it is necessary for scientific workflow systems to integrate with Cloud platforms so as to deal with the ever increasing data scale and analysis complexity. In this paper, we present our experience in offering the Swift scientific workflow management system as a service in the Cloud. Our solution integrates Swift with the OpenNebula Cloud platform, and supports workflow specification and submission, on-demand virtual cluster provisioning, high-throughput task scheduling and execution, and efficient and scalable Cloud resource management. We demonstrate the capability of the solution using a NASA MODIS image processing workflow.
International Journal of Distributed Sensor Networks | 2013
Wenhong Tian; Ruini Xue; Xu Dong; Haoyan Wang
With the development of Internet of things, the number of radio frequency identification (RFID) network readers and tags increases very fast. Large-scale application of RFID networks requires that RFID middleware system can process a large amount of data efficiently, with load-balancing, efficient redundant data elimination, and Web service capabilities so that required information can be transmitted to applications with low overhead and transparency. In view of these objectives and taking especially advantages of virtualization and transparency of Cloud computing, this paper introduces an advanced RFID middleware management system (ARMMS) over Cloud computing, which provides equipment management, RFID tag information management, application level event (ALE) management, and Web service APIs and related functions. ARMMS has different focuses than existing RFID middleware in distributed design, data filtering, integrated load balance, and Web service APIs and designs all these over Cloud. The distributed architecture can support large-scale applications, integrated load-balancing strategy guarantees stability and high performance of the system, and layered and parallel redundancy data elimination scheme makes sure that needed information is transmitted to application level with low overhead; Web service APIs support cross-platform information processing with transparency to lower level RFID hardware.
cyber enabled distributed computing and knowledge discovery | 2014
Ruini Xue; Lixiang Ao; Shengli Gao; Zhongyang Guan; Lupeng Lian
The Hadoop Distributed File System (HDFS) provides scalable, high throughput and fault-tolerant file service for Hadoop distributed frameworks. However, its single Name Node architecture can lead the metadata server to be a performance bottleneck and a single point of failure (SPOF). In this study, we designed and implemented Partitioner, a distributed HDFS metadata server cluster, which can unite multiple Name Nodes and provide single unique namespace for clients. To support the distributed metadata management, a novel sub tree based metadata partitioning and lookup scheme is proposed. Experiment results indicated that this method can eliminate SPOF while expanding metadata capacity, resulting in improved scalability and availability of the distributed file system.
international conference on parallel processing | 2005
Youhui Zhang; Ruini Xue; Dongsheng Wong; Weimin Zheng
As the clusters continue to grow in size and popularity, issues of fault tolerance and reliability turn into limiting factors on application scalability and system availability. To address these issues, we design and implement a high availability parallel run-time system - ChaRM64 for MPI, a checkpoint-based rollback recovery and migration system for MPI programs on a cluster of IA-64 computers. Our approach integrates MPICH with a user-level, single process checkpoint/recovery library for IA-64 Linux, and modifies P4 libraries to implement a coordinated checkpointing and rollback recovery (CRR) and migration mechanism for parallel applications. In addition, the CRR of file operations is supported. Testing shows negligible performance overhead introduced by the CRR mechanism in our implementation.
international symposium on parallel and distributed computing | 2015
Ruini Xue; Shengli Gao; Lixiang Ao; Zhongyang Guan
Task scheduling is critical to reduce the make span of MapReduce jobs. It is an effective approach for scheduling optimization by improving the data locality, which involves attempting to locate a task and its related data block on the same node. However, recent studies have been insufficient in addressing the locality issue. This paper proposes BOLAS, a MapReducetask scheduling algorithm, which models the scheduling processes a bipartite-graph matching problem trying best to assign data block to the nearest task. By considering the divergence of node performance of distribution of data blocks in MapReduce cluster, BOLAS can achieve a high degree of data locality, guarantee minimal data transfer during execution, and reduces a jobs makespan subsequently. As a dynamic algorithm, BOLAS solves the model using Kuhn-Munkres optimal matching algorithm, and can be deployed in either homogeneous or heterogeneous environments. In this study, BOLAS was implemented as a plug in for Hadoop, and the experimental results indicate that BOLAScan localize nearly 100% of the map tasks and reduce the execution time by up to 67.1%.
ieee international conference on cloud computing technology and science | 2014
Yong Zhao; Youfu Li; Ioan Raicu; Cui Lin; Wenhong Tian; Ruini Xue
Cloud computing is an emerging computing paradigm that can offer unprecedented scalability and resources on demand, and is gaining significant adoption in the science community. At the same time, scientific workflow management systems provide essential support and functionality to scientific computing, such as management of data and task dependencies, job scheduling and execution, provenance tracking, fault tolerance. Migrating scientific workflow management systems from traditional Grid computing environments into the Cloud would enable a much broader user base to conduct their scientific research with ever increasing data scale and analysis complexity. This paper presents our experience in integrating the Swift scientific workflow management system with the OpenNebula Cloud platform, which supports workflow specification and submission, on-demand virtual cluster provisioning, high-throughput task scheduling and execution, and efficient and scalable resource management in the Cloud. We set up a series of experiments to demonstrate the capability of our integration and use a MODIS image processing workflow as a showcase of the implementation.
computational science and engineering | 2014
Ruini Xue; Zhongyang Guan; Shengli Gao; Lixiang Ao
As a distributed MapReduce framework, Hadoop has been widely adopted in big data processing, in which HDFS (Hadoop Distributed File System) is mostly used for data storage. Though the single master architecture of HDFS simplifies the design and implementation, it suffers from issues such as SPOF (Single Point Of Failure) and scalability, which further may become performance bottleneck. To address these problems, this paper proposes NM2H, a NoSQL based metadata management approach for HDFS. NM2H separates the storage and query of metadata in contrast to the traditional architecture which mixed them up, and manages to keep the interfaces among the metadata service, Data Nodes and clients unchanged through a novel mapping mechanism between the original metadata structures to NoSQL documents. Therefore, the new approach can not only take advantages of NoSQLs better scalability and fault tolerance, but also deliver transparency to client applications, in which way existing programs can run on the new architecture without any modification. The prototype of NM2H was designed and implemented with widely adopted NoSQL system MongoDB. Extensive performance evaluation was conducted and the experimental results indicated the improvement of NM2H, while the overhead introduced was acceptable.