Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Yuhang Liu is active.

Publication


Featured researches published by Yuhang Liu.


Journal of Computer Science and Technology | 2015

Reevaluating Data Stall Time with the Consideration of Data Access Concurrency

Yuhang Liu; Xian-He Sun

Data access delay has become the prominent performance bottleneck of high-end computing systems. The key to reducing data access delay in system design is to diminish data stall time. Memory locality and concurrency are the two essential factors influencing the performance of modern memory systems. However, existing studies in reducing data stall time rarely focus on utilizing data access concurrency because the impact of memory concurrency on overall memory system performance is not well understood. In this study, a pair of novel data stall time models, the L-C model for the combined effort of locality and concurrency and the P-M model for the effect of pure miss on data stall time, are presented. The models provide a new understanding of data access delay and provide new directions for performance optimization. Based on these new models, a summary table of advanced cache optimizations is presented. It has 38 entries contributed by data concurrency while only has 21 entries contributed by data locality, which shows the value of data concurrency. The L-C and P-M models and their associated results and opportunities introduced in this study are important and necessary for future data-centric architecture and algorithm design of modern computing systems.


international conference on parallel processing | 2015

LPM: Concurrency-Driven Layered Performance Matching

Yuhang Liu; Xian-He Sun

Data access has become the preeminent performance bottleneck of computing. In this study, a Layered Performance Matching (LPM) model and its associated algorithm are proposed to match the request and reply speed for each layer of a memory hierarchy to improve memory performance. The rationale of LPM is that the performance of each layer of a memory hierarchy should and can be optimized to closely match the request of the layer directly above it. The LPM model simultaneously considers both data access concurrency and locality. It reveals the fact that increasing the effective overlapping between hits and misses of the higher layer will alleviate the performance impact of the lower layer. The terms pure miss and pure miss penalty are introduced to measure the effectiveness of such hit-miss overlapping. By distinguishing between (general) miss and pure miss, we have made LPM optimization practical and feasible. Our evaluation shows the data stall time can be reduced significantly with an optimized hardware configuration. We also have achieved noticeable performance improvement by simply adopting smart LPM scheduling without changing the underlying hardware configurations. Analysis and experimental results show LPM is feasible and effective. It provides a novel and efficient way to cope with the ever-widening memory wall problem, and to optimize the vital memory system design.


Cluster Computing | 2013

Asymmetrical topology and entropy-based heterogeneous link for many-core massive data communication

Yuhang Liu; Mingfa Zhu; Limin Xiao; Jue Wang

As the need for data processing and communication increases, and likewise, as the number of processing cores placed on a given single chip increases, improving the performance of interconnection networks is vital. In the present work, traditional topologies are re-examined. Torus is shown to be a good structure in terms of average latency and symmetry. When using torus in combination with high process levels, it is possible to design new, yet asymmetrical topologies that can meet the high communication performance requirements of many-core processors and also suit a large variety of traffic patterns. Firstly, this paper presents two novel and torus-like topologies called xtorus and xxtorus, which are evaluated by using both theoretical analysis and experimental simulation methods. For theoretical analysis, an algorithm for computing link path diversity and link entropy is given. The analysis shows that, compared with mesh, xmesh and torus, the proposed topologies have better properties in terms of diameter, average latency, throughput, and path diversity. Although more links are added, the number of links is of the same order of magnitude with that of mesh, xmesh, and torus. Proposed topologies also take advantage of increasingly higher levels of the VLSI process. Simulations on GEM5 reveal that xtorus has better scalability, and that its average latency is less than that of mesh, xmesh and torus by significant proportions respectively, particularly when the network scale is larger. Moreover, for different traffic patterns, its performance swing is less than that of mesh. Furthermore, in the present work, the proposed topologies are both asymmetrical and based on the entropy difference of the links in the topology. A strategy for heterogeneous link design is presented, which enables designers to trade off between delay, power and area according to a concrete integrated circuit design scene.


IEEE Transactions on Big Data | 2018

CaL: Extending Data Locality to Consider Concurrency for Performance Optimization

Yuhang Liu; Xian-He Sun

Big data applications demand a better memory performance. Data Locality has been the focus of reducing data access delay. Data access concurrency, however, has become prevalent in modern memory systems in recent years. How to extend existing locality-based performance optimization to consider data concurrency becomes a timely issue facing the researchers and practitioners in the field of computing, especially in the field of big data computing. In this study, we introduce the concept and definition of Concurrency-aware data access Locality (CaL), which, as its name states, extends the concept of locality by considering concurrency. Compared to the conventional concept of locality, CaL accurately reflects the combined impact of data access locality and concurrency in modern memory systems and is very effective for data intensive applications. The value of CaL can be quantitatively measured directly by performance counters in mainstream commercial processors and is practically feasible. Two theoretical results are presented to reveal the relationships between CaL and existing memory system performance metrics of memory accesses per cycle (APC), average memory access time (AMAT), and memory bandwidth (B). In this way, we provide a methodology to use existing locality-based optimization methods directly or in combination with data concurrency optimizations, to improve the value of CaL and to improve the performance of a memory system. To demonstrate the practical value of CaL, we conduct four case studies to illustrate the power of concurrency-aware locality optimization. Compared with the conventional locality based optimization, the CaL-aware design has achieved significant performance improvement. It achieved a 3.12-fold speedup on K-means, which is a widely-used data analytic kernel from the big data benchmarks.


ACM Transactions on Modeling and Performance Evaluation of Computing | 2017

Evaluating the Combined Effect of Memory Capacity and Concurrency for Many-Core Chip Design

Yuhang Liu; Xian-He Sun

Modern memory systems are structured under hierarchy and concurrency. The combined impact of hierarchy and concurrency, however, is application dependent and difficult to describe. In this article, we introduce C2-Bound, a data-driven analytical model that serves the purpose of optimizing many-core design. C2-Bound considers both memory capacity and data access concurrency. It utilizes the combined power of the newly proposed latency model, concurrent average memory access time, and the well-known memory-bounded speedup model (Sun-Ni’s law) to facilitate computing tasks. Compared to traditional chip designs that lack the notion of memory capacity and concurrency, the C2-Bound model finds that memory bound factors significantly impact the optimal number of cores as well as their optimal silicon area allocations, especially for data-intensive applications with a non-parallelizable sequential portion. Therefore, our model is valuable to the design of next-generation many-core architectures that target big data processing, where working sets are usually larger than the conventional scientific computing. These findings are evidenced by our detailed simulations, which show, with C2-Bound, the design space of chip design can be narrowed down significantly up to four orders of magnitude. C2-Bound analytic results can be either used in reconfigurable hardware environments or, by software designers, applied to scheduling, partitioning, and allocating resources among diverse applications.


international conference on hardware/software codesign and system synthesis | 2016

Efficient design space exploration by knowledge transfer

Dandan Li; Senzhang Wang; Shuzhen Yao; Yuhang Liu; Yuanqi Cheng; Xian-He Sun

Due to the exponentially increasing size of design space of microprocessors and time-consuming simulations, predictive models have been widely employed in design space exploration (DSE). Traditional approaches mostly build a program-specific predictor that needs a large number of program-specific samples. Thus considerable simulation cost is required for each program. In this paper, we study the novel problem of transferring knowledge from the labeled samples of previous programs to help predict the responses of the new target program whose labeled samples are very sparse. Inspired by the recent advances of transfer learning, we propose a transfer learning based DSE framework TrDSE to build a more efficient and effective predictive model for the target program with only a few simulations by borrowing knowledge from previous programs. Specifically, TrDSE includes two phases: 1) clustering the programs based on the proposed orthogonal array sampling and the distribution related features, and 2) with the guidance of clustering results, predicting the responses of configurations in design space of the target program by a transfer learning based regression algorithm. We evaluate the proposed TrDSE on the benchmarks of SPEC CPU 2006 suite. The results demonstrate that the proposed framework is more efficient and effective than state-of-art DSE techniques.


design automation conference | 2016

Efficient design space exploration via statistical sampling and AdaBoost learning

Dandan Li; Shuzhen Yao; Yuhang Liu; Senzhang Wang; Xian-He Sun

Design space exploration (DSE) has become a notoriously difficult problem due to the exponentially increasing size of design space of microprocessors and time-consuming simulations. To address this issue, machine learning techniques have been widely employed to build predictive models. However, most previous approaches randomly sample the training set leading to considerable simulation cost and low prediction accuracy. In this paper, we propose an efficient and precise DSE methodology by combining statistical sampling and Adaboost learning technique. The proposed method includes three phases. (1) Firstly, orthogonal design based feature selection is employed to prune design space. (2) Sencondly, an orthogonal array based training data sampling method is introduced to select the representative configurations for simulation. (3) Finally, a new active learning approach ActBoost is proposed to build predictive model. Evaluations demonstrate that the proposed framework is more efficient and precise than state-of-art DSE techniques.


international conference on computer design | 2017

TDV Cache: Organizing Off-Chip DRAM Cache of NVMM from a Fusion Perspective

Tianyue Lu; Yuhang Liu; Haiyang Pan; Mingyu Chen

Emerging Non-Volatile Memory (NVM) provides both larger memory capacity and higher energy efficiency, but has much longer access latency than traditional DRAM, thus DRAM can be used as an efficient cache to hide the long latency of Non-Volatile Main Memory (NVMM) system. Transparent Off-chip DRAM cache (TOD cache) is a new DRAM cache structure where off-chip DRAM module is used as L4 cache and managed by hardware. The capacity and latency ratio of TOD cache over NVM are both quite different from those of traditional on-chip SRAM or die-stacked DRAM cache over off-chip DRAM memory. All the factors including hit latency, miss latency and hit rate need to be re-considered for TOD cache design. In this study, we first point out that three types of traditional cache schemes cannot be used directly for TOD cache, since set-associative cache suffers from extra tag lookup latency, direct-mapped cache has low hit rate and tag cache is too small to efficiently hold the working sets of tags for DRAM cache. Based on these observations, we propose a novel cache scheme, TDV, that fuses these three different types of cache together to take their advantages. In TDV, a direct-mapped cache is used as the first-level cache to achieve short access latency, a set-associative victim cache is taken as the second-level cache to obtain extra high hit rate, and a SRAM tag cache only serves for the victim cache rather than the whole DRAM cache and thus improves the hit rate of tag cache significantly. The simulation results show that, TDV cache has a performance improvement of 6.3% and 8.3% on average than state-of-the-art direct-mapped (Alloy cache) and set-associative cache (ATCache) with same DRAM and SRAM capacity.


Journal of Systems Engineering and Electronics | 2014

Elastic pointer directory organization for scalable shared memory multiprocessors

Yuhang Liu; Mingfa Zhu; Limin Xiao

In the field of supercomputing, one key issue for scal-able shared-memory multiprocessors is the design of the directory which denotes the sharing state for a cache block. A good direc-tory design intends to achieve three key attributes: reasonable memory overhead, sharer position precision and implementation complexity. However, researchers often face the problem that gain-ing one attribute may result in losing another. The paper proposes an elastic pointer directory (EPD) structure based on the analysis of shared-memory applications, taking the fact that the number of sharers for each directory entry is typical y smal . Analysis re-sults show that for 4 096 nodes, the ratio of memory overhead to the ful-map directory is 2.7%. Theoretical analysis and cycle-accurate execution-driven simulations on a 16 and 64-node cache coherence non uniform memory access (CC-NUMA) multiproces-sor show that the corresponding pointer overflow probability is reduced significantly. The performance is observed to be better than that of a limited pointers directory and almost identical to the ful-map directory, except for the slight implementation complex-ity. Using the directory cache to explore directory access locality is also studied. The experimental result shows that this is a promis-ing approach to be used in the state-of-the-art high performance computing domain.


Archive | 2010

High-density rack server radiating system

Xianchu Cheng; Guoping Du; Yuhang Liu; Limin Xiao; Mingfa Zhu

Collaboration


Dive into the Yuhang Liu's collaboration.

Top Co-Authors

Avatar

Xian-He Sun

Illinois Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Mingyu Chen

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Senzhang Wang

Nanjing University of Aeronautics and Astronautics

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Haiyang Pan

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Jue Wang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge