Jiangling Yin
University of Central Florida
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Jiangling Yin.
asia pacific magnetic recording conference | 2012
Jun Wang; Qiangju Xiao; Jiangling Yin; Pengju Shang
Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g., Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together result from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-AWare (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8%, reduces the completion latency of the map phase up to 41.7%, and improves the overall performance by 36.4%, in comparison with Hadoops default random placement.
international parallel and distributed processing symposium | 2015
Jiangling Yin; Jun Wang; Jian Zhou; Tyler Lukasiewicz; Dan Huang; Junyao Zhang
In this paper, we study parallel data access on distributed file systems, e.g, the Hadoop file system. Our experiments show that parallel data read requests are often served data remotely and in an imbalanced fashion. This results in a serious disk access and data transfer contention on certain cluster/storage nodes. We conduct a complete analysis on how remote and imbalanced read patterns occur and how they are affected by the size of the cluster. We then propose a novel method to Optimize Parallel Data Access on Distributed File Systems referred to as Opass. The goal of Opass is to reduce remote parallel data accesses and achieve a higher balance of data read requests between cluster nodes. To achieve this goal, we represent the data read requests that are issued by parallel applications to cluster nodes as a graph data structure where edges weights encode the demands of data locality and load capacity. Then we propose new matching-based algorithms to match processes to data based on the configurations of the graph data structure so as to compute the maximum degree of data locality and balanced access. Our proposed method can benefit parallel data-intensive analysis with various parallel data access strategies. Experiments are conducted on PRObEs Marmot 128-node cluster tested and the results from both benchmark and well-known parallel applications show the performance benefits and scalability of Opass.
parallel computing | 2014
Jiangling Yin; Junyao Zhang; Jun Wang; Wu-chun Feng
Abstract In order to run tasks in a parallel and load-balanced fashion, existing scientific parallel applications such as mpiBLAST introduce a data-initializing stage to move database fragments from shared storage to local cluster nodes. Unfortunately, with the exponentially increasing size of sequence databases in today’s big data era, such an approach is inefficient. In this paper, we develop a scalable data access framework to solve the data movement problem for scientific applications that are dominated by “read” operation for data analysis. SDAFT employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two interlocked components: (1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and (2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4–10 and double the overall execution performance as compared with existing schemes.
Journal of Parallel and Distributed Computing | 2017
Jun Wang; Xuhong Zhang; Junyao Zhang; Jiangling Yin; Dezhi Han; Ruijun Wang; Dan Huang
Abstract During the last few decades, Data-intensive File Systems (DiFS), such as Google File System (GFS) and Hadoop Distributed File System (HDFS) have become the key storage architectures for big data processing. These storage systems usually divide files into fixed-sized blocks (or chunks). Each block is replicated (usually three-way) and distributed pseudo-randomly across the cluster. The master node (namenode) uses a huge table to record the locations of each block and its replicas. However, with the increasing size of the data, the block location table and its corresponding maintenance could occupy more than half of the memory space and 30% of processing capacity in master node, which severely limit the scalability and performance of master node. We argue that the physical data distribution and maintenance should be separated out from the metadata management and performed by each storage node autonomously. In this paper, we propose Deister, a novel block management scheme that is built on an invertible deterministic declustering distribution method called Intersected Shifted Declustering (ISD). Deister is amendable to current research on scaling the namespace management in master node. In Deister, the huge table for maintaining the block locations in the master node is eliminated and the maintenance of the block-node mapping is performed autonomously on each data node. Results show that as compared with the HDFS default configuration, Deister is able to achieve identical performance with a saving of about half of the RAM space and 30% of processing capacity in master node and is expected to scale to double the size of current single namenode HDFS cluster, pushing the scalability bottleneck of master node back to namespace management.
international conference on big data | 2013
Jiangling Yin; Andrew Foran; Jun Wang
Currently, most scientific applications based on MPI adopt a compute-centric architecture. Needed data is accessed by MPI processes running on different nodes through a shared file system. Unfortunately, the explosive growth of scientific data undermines the high performance of MPI-based applications, especially in the execution environment of commodity clusters. In this paper, we present a novel approach to enable data locality computation for MPI-based data-intensive applications and refer to it as DL-MPI. DL-MPI allows MPI-based programs to obtain data distribution information for compute nodes through a novel data locality API. In addition, the problem of allocating data processing tasks to parallel processes is formulated as an integer optimization problem with the objectives of achieving data locality computation and optimal parallel execution time. For heterogeneous runtime environments, we propose a scheduling algorithm based on probability to dynamically schedule tasks to processes by evaluating the unprocessed local data and the computing ability of each compute node. We demonstrate the functionality of our methods through the implementation of scientific data processing programs as well as the incorporation of DL-MPI with existing HPC applications.
IEEE Transactions on Big Data | 2018
Jun Wang; Xuhong Zhang; Jiangling Yin; Ruijun Wang; Huafeng Wu; Dezhi Han
In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.
international conference on cloud computing | 2015
Dan Huang; Jiangling Yin; Jun Wang; Xuhong Zhang; Junyao Zhang; Jian Zhou
Recent years have seen an increasing number of Hybrid Scientific Applications. They often consist of one HPC simulation program along with its corresponding data analytics programs. Unfortunately, current computing platform settings do not accommodate this emerging workflow very well. This is mainly because HPC simulation programs store output data into a dedicated storage cluster equipped with Parallel File System PFS. To perform analytics on data generated by simulation, data has to be migrated from storage cluster to compute cluster. This data migration could introduce severe delay which is especially true given an ever-increasing data size. While the scale-up supercomputers equipped with dedicated PFS storage cluster still represent the mainstream HPC, ever increasing scale-out small-medium sized HPC clusters have been supplied to facilitate hybrid scientific workflow applications in fast-growing cloud computing infrastructures such as Amazon cluster compute instances. Different from traditional supercomputer setting, the limited network bandwidth in scale-out HPC clusters makes the data migration prohibitively expensive. To attack the problem, we develop a Unified I/O System Framework UNIO to avoid such migration overhead for scale-out small-medium sized HPC clusters. Our main idea is to enable both HPC simulation programs and analytics programs to run atop one unified file system, e.g. data-intensive file system DIFS in brief. In UNIO, an I/O middle-ware component allows original HPC simulation programs to execute direct I/O operations over DIFS without any porting effort, while an I/O scheduler dynamically smoothes out both disk write and read traffic for both simulation and analysis programs. By experimenting with a real-world scientific workflow over a 46-node UNIO prototype, we found that UNIO is able to achieve comparable read/write I/O performance in small-medium sized HPC clusters equipped with parallel file system. More importantly, since UNIO completely avoids the most expensive data movement overhead, it achieves up to 3x speedups for hybrid scientific workflow applications compared with current solutions.
networking architecture and storages | 2014
Jun Wang; Ruijun Wang; Jiangling Yin; Huijun Zhu; Yuanyuan Yang
Reliability is a critical metric in the design and development of scale-out data storage clusters. A general multiway replication-based declustering scheme has been widely used in enterprise large-scale storage systems to improve the I/O parallelism. Unfortunately, given an increasing number of node failures, how often a cluster starts losing data when being scaled-out is not well investigated. In this paper, we studied the reliability of multi-way declustering layouts by developing an extended model, more specifically abstracting the Continuous Time Markov chain to an ordinary differentiate equation group, and analyzing their potential parallel recovery possibilities. Our comprehensive simulation results on Mat lab and SHARPE show that the shifted declustering layout outperforms the random declustering layout in a multi-way replication scale-out architecture, in terms of data loss probability and system reliability by up to 63% and 85% respectively. Our study on both 5-year and 10-year system reliability equipped with various recovery bandwidth settings shows that, the shifted declustering layout surpasses the random declustering layout in both cases by consuming up to 5.2% and 11% less recovery bandwidth.
Journal of Parallel and Distributed Computing | 2017
Jun Wang; Dan Huang; Huafeng Wu; Jiangling Yin; Xuhong Zhang; Xunchao Chen; Ruijun Wang
Abstract Recent years have seen an increasing number of Hybrid Scientific Applications. They often consist of one HPC simulation program along with its corresponding data analytics programs. Unfortunately, current computing platform settings do not accommodate this emerging workflow very well, especially write-once-read-many workflows. This is mainly because HPC simulation programs store output data into a dedicated storage cluster equipped with Parallel File System(PFS). To perform analytics on data generated by simulation, data has to be migrated from storage cluster to compute cluster. This data migration could introduce severe delay which is especially true given an ever-increasing data size. To solve the data migration problem in small-medium sized HPC clusters, we propose to construct a sided I/O path, named as SideIO, to explicitly direct analysis data to data-intensive file systems (DIFS in brief) that co-locates computation with data. In contrast, checkpoint data may not be read back later, it is written to the dedicated PFS to maximize I/O throughput. There are three components in SideIO. An I/O splitter separates simulation outputs to different storage systems (PFS or DIFS); an I/O middle-ware component allows original HPC simulation programs to execute direct I/O operations over DIFS without any porting effort and an I/O scheduler dynamically smooths out both disk write and read traffic for both simulation and analysis programs. By experimenting with two real-world scientific workflows over a 46-node SideIO prototype, we found that SideIO is able to achieve comparable read/write I/O performance in small-medium sized HPC clusters equipped with PFS. More importantly, since SideIO completely avoids the most expensive data movement overhead, it achieves up to 3x speedups for hybrid scientific workflow applications compared with current solutions.
very large data bases | 2016
Xuhong Zhang; Jun Wang; Jiangling Yin
In this paper, we aim to enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets. To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to facilitate online sampling. There are three thrusts in Sapprox. First, we develop a probabilistic map to reduce the exponential number of recorded sub-datasets to a linear one. Second, we apply the cluster sampling with unequal probability theory to implement a distribution-aware sampling method for efficient online sub-dataset sampling. Third, we quantitatively derive the optimal sampling unit size in a distributed file system by associating it with approximation costs and accuracy. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to 20× over the precise execution.