Is this you? Create Your Porfile

Xuhao Chen

National University of Defense Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xuhao Chen is active.

Explore More

Publication

Featured researches published by Xuhao Chen.

international symposium on microarchitecture | 2014

Adaptive Cache Management for Energy-Efficient GPU Computing

Xuhao Chen; Li-Wen Chang; Christopher I. Rodrigues; Jie Lv; Zhiying Wang; Wen-mei W. Hwu

With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency. The massive amount of memory requests generated by GPU scause cache contention and resource congestion. Existing CPUcache management policies that are designed for multicoresystems, can be suboptimal when directly applied to GPUcaches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturatingon-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cache sensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPCimprovement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

international workshop on manycore embedded systems | 2014

Adaptive Cache Bypass and Insertion for Many-core Accelerators

Xuhao Chen; Shengzhao Wu; Li-Wen Chang; Wei Sheng Huang; Carl Pearson; Zhiying Wang; Wen-mei W. Hwu

Many-core accelerators, e.g. GPUs, are widely used for accelerating general-purpose compute kernels. With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many regular applications. To support more applications with irregular memory access pattern, cache hierarchy is introduced to GPU architecture to capture input data sharing and mitigate the effect of irregular accesses. However, GPU caches suffer from poor efficiency due to severe contention, which makes it difficult to adopt heuristic management policies, and also limits system performance and energy-efficiency. We propose an adaptive cache management policy specifically for many-core accelerators. The tag array of L2 cache is enhanced with extra bits to track memory access history, an thus the locality information is captured and provided to L1 cache as heuristics to guide its run-time bypass and insertion decisions. By preventing un-reused data from polluting the cache and alleviating contention, cache efficiency is significantly improved. As a result, the system performance is improved by 31% on average for cache sensitive benchmarks, compared to the baseline GPU architecture.

design automation conference | 2016

Architecting energy-efficient STT-RAM based register file on GPGPUs via delta compression

Hang Zhang; Xuhao Chen; Nong Xiao; Fang Liu

To facilitate efficient context switches, GPUs usually employ a large-capacity register file to accommodate a massive amount of context information. However, the large register file introduces high power consumption, owing to high leakage power SRAM cells. Emerging non-volatile STT-RAM memory has recently been studied as a potential replacement to alleviate the leakage challenge when constructing register files on GPUs. Unfortunately, due to the long write latency and high energy consumption associated with write operations in STT-RAM, simply replacing SRAM with STT-RAM for register files would incur non-trivial performance overhead and only bring marginal energy benefits. In this paper, we propose to optimize STT-RAM based GPU register files for better energy-efficiency and performance via two techniques. First, we employ a light-weight compression framework with awareness of register value similarity. It is coupled with a group-based write driver control to mitigate the high energy overhead caused by STT-RAM writes. Second, to address the long write latency overhead of STT-RAM, we propose a centralized SRAM-based write buffer design to efficiently absorb STT-RAM writes with better buffer utilization, rather than the conventional design with distributed per-bank based write buffers. The experimental results show that our STT-RAM based register file design consumes only 37.4% energy over the SRAM baseline, while incurring only negligible performance degradation.

great lakes symposium on vlsi | 2016

Red-Shield: Shielding Read Disturbance for STT-RAM Based Register Files on GPUs

Hang Zhang; Xuhao Chen; Nong Xiao; Fang Liu; Zhiguang Chen

To address the high energy consumption issue of SRAM on GPUs, emerging Spin-Transfer Torque (STT-RAM) memory technology has been intensively studied to build GPU register files for better energy-efficiency, thanks to its benefits of low leakage power, high density, and good scalability. However, STT-RAM suffers from a reliability issue, read disturbance, which stems from the fact that the voltage difference between read current and write current becomes smaller as technology scales. The read disturbance leads to high error rates for read operations, which cannot be effectively protected by SECDEC ECC on large-capacity register files of GPUs. Prior schemes (e.g. read-restore) to mitigate the read disturbance usually incur either non-trivial performance loss or excessive energy overhead, thus not applicable for the GPU register file design which aims to achieve both high performance and energy-efficiency. To combat the read disturbance on GPU register files, we propose a novel software-hardware co-designed solution, i.e. Red-Shield, which consists of three optimizations to overcome limitations of the existing solutions. First, we identify dead reads at compiling stage and augment instructions to avoid unnecessary restores. Second, we employ a small read buffer to accommodate register reads with high access locality to further reduce restores. Third, we propose an adaptive restore mechanism to selectively pick the suitable restore scheme, according to the busy status of corresponding register banks. Experimental results show that our proposed design can effectively mitigate the performance loss and energy overhead caused by restore operations, while still maintaining the reliability of reads.

international symposium on parallel architectures, algorithms and programming | 2011

GSM: An Efficient Code Generation Algorithm for Dynamic Binary Translator

Xuhao Chen; Zhong Zheng; Li Shen; Wei Chen; Zhiying Wang

Dynamic binary translation is an effective way to address binary compatibility problem. Embedded systems and other novel RISC ISAs are developing fast without consideration of the binary compatibility with off-the-shelf x86 applications, making dynamic binary translator (DBT) from CISC to RISC more important. However, dynamic code generation is still inefficient due to the code expansion. Conventional code generators in DBTs use one-to-many mapping scheme between source and target code which cannot take full advantage of the target ISA. We propose a novel lightweight code generation algorithm GSM (Greedy Sub graph Mapping), which can generate compact code with low overhead using many-to-one mapping. GSM is implemented and evaluated in a DBT prototype system called TransARM. Experimental results demonstrate that GSM generates higher quality target code compared to a conventional implementation, which brings code expansion rate close to 1.3. Moreover, GSM causes slightly extra overhead and negligible slowdown of translation, and enables 10% performance improvement for target code execution.

international conference on transportation mechanical and electrical engineering | 2011

Performance model for OpenMP parallelized loops

Zhong Zheng; Xuhao Chen; Zhiying Wang; Li Shen; Jiawen Li

OpenMP is one of the most widely used parallel programming techniques in modern multi-core era. Parallelizing a loop using OpenMP is just as simple as adding a few directive sentences. However, for its simplicity, it is not rare that programmers excessively use OpenMP to parallelize loops in various applications which introduce too much overhead and lead to performance degradation. This paper establishes a performance model for OpenMP parallelized loops to address the critical factors which influence the performance. The model is validated through experiments on three different multi-core platforms. The results shows that best performance can be obtained when number of threads used in OpenMP applications equals to the number of cores that available on the platform. And parallelizing the outmost loop in nested loops can get higher speedup.

international conference on parallel and distributed systems | 2011

Characterizing Fine-Grain Parallelism on Modern Multicore Platform

Xuhao Chen; Wei Chen; Jiawen Li; Zhong Zheng; Li Shen; Zhiying Wang

Since chip multiprocessors have dominated the processor market, developing a parallel programming model with proper trade-off between productivity and efficiency become increasingly important. As a typical fine-grain parallelism model, Intel Threading Building Blocks (TBB) simplifies parallel programming by runtime schedule. Despite its simplicity, it costs non-trivial runtime overhead which may increase as the thread counts increase. In this work, we conduct an experiment on real commodity hardware to evaluate performance scalability of TBB using PARSEC benchmark suite. We first compare TBB with Pthreads to show that TBB applications can achieve comparable performance as Pthreads applications. To find the performance bottleneck of TBB applications, we measure the runtime overhead of TBB focused on 3 basic TBB runtime activities. The result provides valuable implications which can be used to develop scalable runtime libraries and architectural support for alleviating performance bottlenecks.

international conference of information technology, computer engineering and management sciences | 2011

Evaluating Scalability of Emerging Multithreaded Applications on Commodity Multicore Server

Xuhao Chen; Jiawen Li; Zhong Zheng; Li Shen; Zhiying Wang

The performance of multithreaded applications is often limited by resources such as shared cache and memory bandwidth. Several prior studies have examined this issue, but most of them have been constrained by the use of simulators and out-of-date benchmarks. In this work, we conduct an experiment on real commodity CMP machines, using a recently released CMP benchmark suite, PARSEC, to investigate the influence of cache sharing and memory bandwidth on the scalability of emerging parallel applications. The results reveal that the behavioral characteristics of these benchmarks. We find that the shared cache and memory bandwidth are indeed the bottlenecks for some of these applications. The conclusion provides implications for hardware manufacturers and system software designers to build scalable parallel system.

international parallel and distributed processing symposium | 2017

Efficient and Portable ALS Matrix Factorization for Recommender Systems

Jing Chen; Jianbin Fang; Weifeng Liu; Tao Tang; Xuhao Chen; Canqun Yang

Alternating least squares (ALS) has been proved to be an effective solver of matrix factorization for recommender systems. To speedup factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-core CPUs and many-core GPUs/MICs. Existing implementations are limited in either speed or portability (constrained to certain platforms). In this paper, we present an efficient and portable ALS solver for recommender systems. On the one hand, we diagnose the baseline implementation and observe that it lacks the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs, and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently mapping it to the underlying hardware. The experimental results show that our implementation performs 5.5 faster on a 16-core CPU and 21.2 faster on K20c than the baseline implementation. Our implementation also outperforms cuMF on various datasets.

Concurrency and Computation: Practice and Experience | 2017

Efficient and High-quality Sparse Graph Coloring on the GPU.

Xuhao Chen; Pingfan Li; Jianbin Fang; Tao Tang; Zhiying Wang; Canqun Yang

Graph coloring has been broadly used to discover concurrency in parallel computing. To speed up graph coloring for large‐scale datasets, parallel algorithms have been proposed to leverage modern GPUs. Existing GPU implementations either have limited performance or yield unsatisfactory coloring quality (too many colors assigned). We present a work‐efficient parallel graph coloring implementation on GPUs with good coloring quality. Our approach uses the speculative greedy scheme, which inherently yields better quality than the method of finding maximal independent set. To achieve high performance on GPUs, we refine the algorithm to leverage efficient operators and alleviate conflicts. We also incorporate common optimization techniques to further improve performance. Our method is evaluated with both synthetic and real‐world sparse graphs on the NVIDIA GPU. Experimental results show that our proposed implementation achieves averaged 4.1 × (up to 8.9 × ) speedup over the serial implementation. It also outperforms the existing GPU implementation from the NVIDIA CUSPARSE library (2.2 × average speedup), while yielding much better coloring quality than CUSPARSE.

Explore More