Is this you? Create Your Porfile

Xiaocheng Zou

North Carolina State University

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xiaocheng Zou is active.

Explore More

Publication

Featured researches published by Xiaocheng Zou.

ieee/acm international symposium cluster, cloud and grid computing | 2013

PARLO: PArallel Run-Time Layout Optimization for Scientific Data Explorations with Heterogeneous Access Patterns

Zhenhuan Gong; David A. Boyuka; Xiaocheng Zou; Qing Liu; Norbert Podhorszki; Scott Klasky; Xiaosong Ma; Nagiza F. Samatova

The size and scope of cutting-edge scientific simulations are growing much faster than the I/O and storage capabilities of their run-time environments. The growing gap is exacerbated by exploratory, data-intensive analytics, such as querying simulation data with multivariate, spatio-temporal constraints, which induces heterogeneous access patterns that stress the performance of the underlying storage system. Previous work addresses data layout and indexing techniques to improve query performance for a single access pattern, which is not sufficient for complex analytics jobs. We present PARLO a parallel run-time layout optimization framework, to achieve multi-level data layout optimization for scientific applications at run-time before data is written to storage. The layout schemes optimize for heterogeneous access patterns with user-specified priorities. PARLO is integrated with ADIOS, a high-performance parallel I/O middleware for large-scale HPC applications, to achieve user-transparent, light-weight layout optimization for scientific datasets. It offers simple XML-based configuration for users to achieve flexible layout optimization without the need to modify or recompile application codes. Experiments show that PARLO improves performance by 2 to 26 times for queries with heterogeneous access patterns compared to state-of-the-art scientific database management systems. Compared to traditional post-processing approaches, its underlying run-time layout optimization achieves a 56% savings in processing time and a reduction in storage overhead of up to 50%. PARLO also exhibits a low run-time resource requirement, while also limiting the performance impact on running applications to a reasonable level.

international conference on supercomputing | 2014

RADAR: Runtime Asymmetric Data-Access Driven Scientific Data Replication

John Jenkins; Xiaocheng Zou; Houjun Tang; Dries Kimpe; Robert B. Ross; Nagiza F. Samatova

Efficient I/O on large-scale spatiotemporal scientific data requires scrutiny of both the logical layout of the data e.g., row-major vs. column-major and the physical layout e.g., distribution on parallel filesystems. For increasingly complex datasets, hand optimization is a difficult matter prone to error and not scalable to the increasing heterogeneity of analysis workloads. Given these factors, we present a partial data replication system called RADAR. We capture datatype- and collective-aware I/O access patterns indicating logical access via MPI-IO tracing and use a combination of coarse-grained and fine-grained performance modeling to evaluate and select optimized physical data distributions for the task at hand. Unlike conventional methods, we store all replica data and metadata, along with the original untouched data, under a single file container using the object abstraction in parallel filesystems. Our system results in manyfold improvements in some commonly used subvolume decomposition access patterns.Moreover, the modeling approach can determine whether such optimizations should be undertaken in the first place.

european conference on parallel processing | 2014

Improving Read Performance with Online Access Pattern Analysis and Prefetching

Houjun Tang; Xiaocheng Zou; John Jenkins; David A. Boyuka; Stephen Ranshous; Dries Kimpe; Scott Klasky; Nagiza F. Samatova

Among the major challenges of transitioning to exascale in HPC is the ubiquitous I/O bottleneck. For analysis and visualization applications in particular, this bottleneck is exacerbated by the write-onceread- many property of most scientific datasets combined with typically complex access patterns. One promising way to alleviate this problem is to recognize the application’s access patterns and utilize them to prefetch data, thereby overlapping computation and I/O. However, current research methods for analyzing access patterns are either offline-only and/or lack the support for complex access patterns, such as high-dimensional strided or composition-based unstructured access patterns. Therefore, we propose an online analyzer capable of detecting both simple and complex access patterns with low computational and memory overhead and high accuracy. By combining our pattern detection with prefetching,we consistently observe run-time reductions, up to 26%, across 18 configurations of PIOBench and 4 configurations of a micro-benchmark with both structured and unstructured access patterns.

ieee/acm international symposium cluster, cloud and grid computing | 2015

Parallel In Situ Detection of Connected Components in Adaptive Mesh Refinement Data

Xiaocheng Zou; Kesheng Wu; David A. Boyuka; Daniel F. Martin; Surendra Byna; Houjun Tang; Kushal Bansal; Terry J. Ligocki; Hans Johansen; Nagiza F. Samatova

Adaptive Mesh Refinement (AMR) represents a significant advance for scientific simulation codes, greatly reducing memory and compute requirements by dynamically varying simulation resolution over space and time. As simulation codes transition to AMR, existing analysis algorithms must also make this transition. One such algorithm, connected component detection, is of vital importance in many simulation and analysis contexts, with some simulation codes even relying on parallel, in situ connected component detection for correctness. Yet, current detection algorithms designed for uniform meshes are not applicable to hierarchical, non-uniform AMR, and to the best of our knowledge, AMR connected component detection has not been explored in the literature. Therefore, in this paper, we formally define the general problem of connected component detection for AMR, and present a general solution. Beyond solving the general detection problem, achieving viable in situ detection performance is even more challenging. The core issue is the conflict between the communication-intensive nature of connected component detection (in general, and especially for AMR data) and the requirement that in situ processes incur minimal performance impact on the co-located simulation. We address this challenge by presenting the first connected component detection methodology for structured AMR that is applicable in a parallel, in situ context. Our key strategy is the incorporation of an multi-phase AMR-aware communication pattern that synchronizes connectivity information across the AMR hierarchy. In addition, we distil our methodology to a generic framework within the Combo AMR infrastructure, making connected component detection services available for many existing applications. We demonstrate our methods efficacy by showing its ability to detect ice calving events in real time within the real-world BISICLES ice sheet modelling code. Results show up to a 6.8x speedup of our algorithm over the existing specialized BISICLES algorithm. We also show scalability results for our method up to 4,096 cores using a parallel Combo-based benchmark.

european conference on parallel processing | 2014

Fast Set Intersection through Run-Time Bitmap Construction over PForDelta-Compressed Indexes

Xiaocheng Zou; Sriram Lakshminarasimhan; David A. Boyuka; Stephen Ranshous; Houjun Tang; Scott Klasky; Nagiza F. Samatova

Set intersection is a fundamental operation for evaluating conjunctive queries in the context of scientific data analysis. The state-of-the-art approach in performing set intersection, compressed bitmap indexing, achieves high computational efficiency because of cheap bitwise operations; however, overall efficiency is often nullified by the HPC I/O bottleneck, because compressed bitmap indexes typically exhibit a heavy storage footprint. Conversely, the recently-presented PForDelta-compressed index has been demonstrated to be storage-lightweight, but has limited performance for set intersection. Thus, a more effective set intersection approach should be efficient in both computation and I/O.

international conference on cluster computing | 2015

Exploring Memory Hierarchy to Improve Scientific Data Read Performance

Wenzhao Zhang; Houjun Tang; Xiaocheng Zou; Steven Harenberg; Qing Liu; Scott Klasky; Nagiza F. Samatova

Improving read performance is one of the major challenges with speeding up scientific data analytic applications. Utilizing the memory hierarchy is one major line of researches to address the read performance bottleneck. Related methods usually combine solide-state-drives(SSDs) with dynamic random-access memory(DRAM) and/or parallel file system(PFS) to mitigate the speed and space gap between DRAM and PFS. However, these methods are unable to handle key performance issues plaguing SSDs, namely read contention that may cause up to 50% performance reduction. In this paper, we propose a framework that exploits the memory hierarchy resource to address the read contention issues involved with SSDs. The framework employs a general purpose online read algorithm that able to detect and utilize memory hierarchy resource to relieve the problem. To maintain a near optimal operating environment for SSDs, the framework is able to orchastrate data chunks across different memory layers to facilitate the read algorithm. Compared to existing tools, our framework achieves up to 50% read performance improvement when tested on datasets from real-world scientific simulations.

cluster computing and the grid | 2016

AMRZone: A Runtime AMR Data Sharing Framework for Scientific Applications

Wenzhao Zhang; Houjun Tang; Steve Harenberg; Surendra Byna; Xiaocheng Zou; Dharshi Devendran; Daniel F. Martin; Kesheng Wu; Bin Dong; Scott Klasky; Nagiza F. Samatova

Frameworks that facilitate runtime data sharingacross multiple applications are of great importance for scientificdata analytics. Although existing frameworks work well overuniform mesh data, they can not effectively handle adaptive meshrefinement (AMR) data. Among the challenges to construct anAMR-capable framework include: (1) designing an architecturethat facilitates online AMR data management, (2) achievinga load-balanced AMR data distribution for the data stagingspace at runtime, and (3) building an effective online indexto support the unique spatial data retrieval requirements forAMR data. Towards addressing these challenges to supportruntime AMR data sharing across scientific applications, wepresent the AMRZone framework. Experiments over real-worldAMR datasets demonstrate AMRZones effectiveness at achievinga balanced workload distribution, reading/writing large-scaledatasets with thousands of parallel processes, and satisfyingqueries with spatial constraints. Moreover, AMRZones performance and scalability are even comparable with existing state-of-the-art work when tested over uniform mesh data with up to16384 cores, in the best case, our framework achieves a 46% performance improvement.

cluster computing and the grid | 2016

Usage Pattern-Driven Dynamic Data Layout Reorganization

Houjun Tang; Surendra Byna; Steve Harenberg; Xiaocheng Zou; Wenzhao Zhang; Kesheng Wu; Bin Dong; Oliver Rübel; Kristofer E. Bouchard; Scott Klasky; Nagiza F. Samatova

As scientific simulations and experiments move toward extremely large scales and generate massive amounts of data, the data access performance of analytic applications becomes crucial. A mismatch often happens between write and read patterns of data accesses, typically resulting in poor read performance. Data layout reorganization has been used to improve the locality of data accesses. However, current data reorganizations are static and focus on generating a single (or set of) optimized layouts that rely on prior knowledge of exact future access patterns. We propose a framework that dynamically recognizes the data usage patterns, replicates the data of interest in multiple reorganized layouts that would benefit common read patterns, and makes runtime decisions on selecting a favorable layout for a given read pattern. This framework supports reading individual elements and chunks of a multi-dimensional array of variables. Our pattern-driven layout selection strategy achieves multi-fold speedups compared to reading from the original dataset.

international conference on parallel processing | 2016

In Situ Storage Layout Optimization for AMR Spatio-temporal Read Accesses

Houjun Tang; Suren Byna; Steve Harenberg; Wenzhao Zhang; Xiaocheng Zou; Daniel F. Martin; Bin Dong; Dharshi Devendran; Kesheng Wu; David Trebotich; Scott Klasky; Nagiza F. Samatova

Analyses of large simulation data often concentrate on regions in space and in time that contain important information. As simulations adopt Adaptive Mesh Refinement (AMR), the data records from a region of interest could be widely scattered on storage devices and accessing interesting regions results in significantly reduced I/O performance. In this work, we study the organization of block-structured AMR data on storage to improve performance of spatio-temporal data accesses. AMR has a complex hierarchical multi-resolution data structure that does not fit easily with the existing approaches that focus on uniform mesh data. To enable efficient AMR read accesses, we develop an in situ data layout optimization framework. Our framework automatically selects from a set of candidate layouts based on a performance model, and reorganizes the data before writing to storage. We evaluate this framework with three AMR datasets and access patterns derived from scientific applications. Our performance model is able to identify the best layout scheme and yields up to a 3X read performance improvement compared to the original layout. Though it is not possible to turn all read accesses into contiguous reads, we are able to achieve 90% of contiguous read throughput with the optimized layouts on average.

high performance distributed computing | 2013