Xunchao Chen | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xunchao Chen is active.

Explore More

Publication

Featured researches published by Xunchao Chen.

design automation conference | 2016

AOS: adaptive overwrite scheme for energy-efficient MLC STT-RAM cache

Xunchao Chen; Navid Khoshavi; Jian Zhou; Dan Huang; Ronald F. DeMara; Jun Wang; Wujie Wen; Yiran Chen

Spin-Transfer Torque Random Access Memory (STT-RAM) has been identified as an advantageous candidate for on-chip memory technology due to its high density and ultra low leakage power. Recent research progress in Magnetic Tunneling Junction (MTJ) devices has developed Multi-Level Cell (MLC) STT-RAM to further enhance cell density. To correct the write disturbance in MLC strategy, data stored in the soft bit must be restored back immediately after the hard bit switching is completed. However, frequent restores are not only unnecessary, but also introduce a significant energy consumption overhead. In this paper, we propose an Adaptive Overwrite Scheme (AOS) which alleviates restoration overhead by intentionally overwriting selected soft bit lines based on RRD (Read Reuse Distance). Our experimental results show 54.6% reduction in soft bit restoration, delivering 10.8% decrease in overall energy consumption. Moreover, AOS promotes MLC to be a preferable L2 design alternative in terms of energy, area and latency product.

international symposium on quality electronic design | 2016

Bit-Upset Vulnerability Factor for eDRAM Last Level Cache immunity analysis

Navid Khoshavi; Xunchao Chen; Jun Wang; Ronald F. DeMara

Whereas contemporary Last Level Cache (LLC) designs occupy a significant fraction of total die area in chip-multiprocessors (CMPs), approaches to deal with the vulnerability increase of LLC against Single Event Upset (SEU) and Multi-Bit Upsets (MBUs) are sought. In this paper, we focus on reliability assessment of eDRAM LLC to propose a more accurate and application-relevant vulnerability estimation approach compared to conventional LLC SEU analysis methods. In particular, the eDRAM Bit Upset Vulnerability Factor (BUVF) is proposed and an algorithm is developed to assess its behavior for soft errors using experimental benchmark suites. BUVF explicitly targets the vulnerable portion of the eDRAM refresh cycle where the critical charge varies depending on write voltage, storage and bit-line capacitance. Results for the PARSEC benchmark suite indicated that vulnerable sequences account for about 27.2% of data array lifetime in the cache, among which the Read-Read (RR) access sequence contributes about 23.4%. Furthermore, regardless of the size of the vulnerable data set located in an RR sequence over a short interval, the corresponding region of cache is seen to contribute negligible vulnerability to BUVF, which results from spending a small fraction of program execution time undergoing RR sequences. We recast the problem of reliable eDRAM LLC design as a straightforward search for reduced BUVF.

IEEE Transactions on Computers | 2017

Energy-Aware Adaptive Restore Schemes for MLC STT-RAM Cache

Xunchao Chen; Navid Khoshavi; Ronald F. DeMara; Jun Wang; Dan Huang; Wujie Wen; Yiran Chen

For the sake of higher cell density while achieving near-zero standby power, recent research progress in Magnetic Tunneling Junction (MTJ) devices has leveraged Multi-Level Cell (MLC) configurations of Spin-Transfer Torque Random Access Memory (STT-RAM). However, in order to mitigate the write disturbance in an MLC strategy, data stored in the soft bit must be restored back immediately after the hard bit switching is completed. Furthermore, as the result of MTJ feature size scaling, the soft bit can be expected to become disturbed by the read sensing current, thus requiring an immediate restore operation to ensure the data reliability. In this paper, we design and analyze a novel Adaptive Restore Scheme for Write Disturbance (ARS-WD) and Read Disturbance (ARS-RD), respectively. ARS-WD alleviates restoration overhead by intentionally overwriting soft bit lines which are less likely to be read. ARS-RD, on the other hand, aggregates the potential writes and restore the soft bit line at the time of its eviction from higher level cache. Both of these two schemes are based on a lightweight forecasting approach for the future read behavior of the cache block. Our experimental results show substantial reduction in soft bit line restore operations, delivering 17.9 percent decrease in overall energy consumption and 9.4 percent increase in IPC, while incurring negligible capacity overhead. Moreover, ARS promotes advantages of MLC to provide a preferable L2 design alternative in terms of energy, area and latency product compared to SLC STT-RAM alternatives.

Journal of Parallel and Distributed Computing | 2017

SideIO: A Side I/O system framework for hybrid scientific workflow

Jun Wang; Dan Huang; Huafeng Wu; Jiangling Yin; Xuhong Zhang; Xunchao Chen; Ruijun Wang

Abstract Recent years have seen an increasing number of Hybrid Scientific Applications. They often consist of one HPC simulation program along with its corresponding data analytics programs. Unfortunately, current computing platform settings do not accommodate this emerging workflow very well, especially write-once-read-many workflows. This is mainly because HPC simulation programs store output data into a dedicated storage cluster equipped with Parallel File System(PFS). To perform analytics on data generated by simulation, data has to be migrated from storage cluster to compute cluster. This data migration could introduce severe delay which is especially true given an ever-increasing data size. To solve the data migration problem in small-medium sized HPC clusters, we propose to construct a sided I/O path, named as SideIO, to explicitly direct analysis data to data-intensive file systems (DIFS in brief) that co-locates computation with data. In contrast, checkpoint data may not be read back later, it is written to the dedicated PFS to maximize I/O throughput. There are three components in SideIO. An I/O splitter separates simulation outputs to different storage systems (PFS or DIFS); an I/O middle-ware component allows original HPC simulation programs to execute direct I/O operations over DIFS without any porting effort and an I/O scheduler dynamically smooths out both disk write and read traffic for both simulation and analysis programs. By experimenting with two real-world scientific workflows over a 46-node SideIO prototype, we found that SideIO is able to achieve comparable read/write I/O performance in small-medium sized HPC clusters equipped with PFS. More importantly, since SideIO completely avoids the most expensive data movement overhead, it achieves up to 3x speedups for hybrid scientific workflow applications compared with current solutions.

international conference on distributed computing systems workshops | 2016

Leveraging Semantic Links for High Efficiency Page-Level FTL Design

Jian Zhou; Xunchao Chen; Jun Wang; Fei Wu; You Zhou; Changsheng Xie

NAND Flash Solid State Disks (SSDs) are gainingtremendous popularity in todays storage market due to theirlow energy consumption and high I/O performance. To maskthe unique erase-before-write feature of NAND flash, the FlashTranslation Layer (FTL) in SSD redirects the incoming writesto a free physical address and manages a logical to physicaladdress mapping table. However, the increasing capacity of SSDhas lead to mapping tables large in size, which not only imposehigh pressure on the efficiency of page-level address mapping, but also induces significant performance degradation to SSD. To overcome this problem, Correlation-Aware Page-level FTL(CPFTL) is proposed in this work. CPFTL uniquely leveragesthe inherent data semantics in enterprise workloads to optimizemapping table cache management. First, a correlation-awaremapping table is developed based on the correlation in readoperations. We then propose a correlation prediction table tosupport fast mapping entry lookup in correlation-aware mappingtable. Our experimental results show that CPFTL reduces theaverage response time by 63.4% for read dominant workloadsand 32.9% for transaction workloads.

petascale data storage workshop | 2015

Experiences in using os-level virtualization for block I/O

Dan Huang; Jun Wang; Qing Liu; Jiangling Yin; Xuhong Zhang; Xunchao Chen

Today, HPC clusters commonly use Resource Management Systems such as PBS and TORQUE to share physical resources. These systems enable resources to be shared by assigning nodes to users exclusively in non-overlapping time slots. With virtualization technology, users can run their applications on the same node with low mutual interference. However, the overhead introduced by the virtual machine monitor or hypervisor is too high to be accepted, because efficiency is key to many HPC applications. OS-level virtualization (such as Linux Containers) offers a lightweight virtualization layer, which promises a near-native performance and is adopted by some BigData resource sharing platforms such as Mesos. Nevertheless, OS-level virtualizations overhead and isolation on block devices have not been completely evaluated, especially when applied to a shared distributed/parallel file system (D/PFS) such as HDFS or Lustre. In this paper, we thoroughly evaluate the overhead and isolation involved in sharing block I/O via OS-level virtualization on the local disk and D/PFSs. Meanwhile, to assign D/PFS storage resources to users, a middleware system is proposed and implemented to bridge the configuration gap between virtual clusters and remote D/PFSs.

networking architecture and storages | 2015

Achieving up to zero communication delay in BSP-based graph processing via vertex categorization

Xuhong Zhang; Ruijun Wang; Xunchao Chen; Jun Wang; Tyler Lukasiewicz; Dezhi Han

The Bulk Synchronous Parallel (BSP) model, which divides a graphing algorithm into multiple supersteps, has become extremely popular in distributed graph processing systems. However, the high number of network messages exchanged in each superstep of the graph algorithm will create a long period of time. We refer to this as a communication delay. Furthermore, the BSPs global synchronization barrier does not allow computation in the next superstrep to be scheduled during this communication delay. This communication delay makes up a large percentage of the overall processing time of a superstep. While most recent research has focused on reducing number of network messages, but communication delay is still a deterministic factor for overall performance. In this paper, we add a runtime communication and computation scheduler into current graph BSP implementations. This scheduler will move some computation from the next superstep to the communication phase in the current superstep to mitigate the communication delay. Finally, we prototyped our system, Zebra, on Apache Hama, which is an open source clone of the classic Google Pregel. By running a set of graph algorithms on an in-house cluster, our evaluation shows that our system could completely eliminate the communication delay in the best case and can achieve average 2X speedup over Hama.

symposium on cloud computing | 2017

DFS-container: achieving containerized block I/O for distributed file systems

Dan Huang; Jun Wang; Qing Liu; Xuhong Zhang; Xunchao Chen; Jian Zhou

Today BigData systems commonly use resource management systems such as TORQUE, Mesos, and Google Borg to share the physical resources among users or applications. Enabled by virtualization, users can run their applications on the same node with low mutual interference. Container-based virtualizations (e.g., Docker and Linux Containers) offer a lightweight virtualization layer, which promises a near-native performance and is adopted by some Big-Data resource sharing platforms such as Mesos. Nevertheless, using containers to consolidate the I/O resources of shared storage systems is still at an early stage, especially in a distributed file system (DFS) such as Hadoop File System (HDFS). To overcome this issue, we propose a distributed middleware system, DFS-Container, by further containerizing DFS. We also evaluate and analyze the unfairness of using containers to proportionally allocate the I/O resource of DFS. Based on these analyses and evaluations, we propose and implement a new mechanism, IOPS-Regulator, which improve the fairness of proportional allocation by 74.4% on average.

arXiv: Hardware Architecture | 2016