Yongbing Huang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yongbing Huang is active.

Explore More

Publication

Featured researches published by Yongbing Huang.

international symposium on performance analysis of systems and software | 2014

Moby: A mobile benchmark suite for architectural simulators

Yongbing Huang; Zhongbin Zha; Mingyu Chen; Lixin Zhang

Mobile devices such as smartphones and tablets have become the primary consumer computing devices, and their rate of adoption continues to grow. The applications that run on these mobile platforms vary in how they use hardware resources, and their diversity is increasing. Performance and power limitations also vary widely across mobile platforms. Thus there is a growing need for tools to help computer architects design systems to meet the needs of mobile workloads. Full-system simulators are invaluable tools for designing new architectures, but we still need appropriate benchmark suites that capture the behaviors of emerging mobile applications. Current benchmark suites cover only a small range of mobile applications, and many cannot run directly in simulators due to their user interaction requirements. In this paper, we introduce and characterize Moby, a benchmark suite designed to make it easier to use full-system architectural simulators to evaluate microarchitectures for mobile processors. Moby contains popular Android applications, including a web browser, a social networking application, an email client, a music player, a video player, a document processing application, and a map program. To facilitate microarchitectural exploration, we port the Moby benchmark suite to the popular gem5 simulator. We characterize the architecture-independent features of Moby applications on the simulator and analyze the architecture-dependent features on a current-generation mobile platform. Our results show that mobile applications exhibit complex instruction execution behaviors and poor code locality, but current mobile platforms especially instruction-related components cannot meet their requirements.

international conference on parallel architectures and compilation techniques | 2012

HaLock: hardware-assisted lock contention detection in multithreaded applications

Yongbing Huang; Zehan Cui; Licheng Chen; Wenli Zhang; Yungang Bao; Mingyu Chen

Multithreaded programming relies on locks to ensure the consistency of shared data. Lock contention is the main reason of low parallel efficiency and poor scalability of multithreaded programs. Lock profiling is the primary approach to detect lock contention. Prior lock profiling tools are able to track lock behaviors but directly store profiling data into local memory regardless of the memory interference on targeted programs.

Journal of Computer Science and Technology | 2014

MIMS: Towards a Message Interface Based Memory System

Licheng Chen; Mingyu Chen; Yuan Ruan; Yongbing Huang; Zehan Cui; Tianyue Lu; Yungang Bao

The decades-old synchronous memory bus interface has restricted many innovations in the memory system, which is facing various challenges (or walls) in the era of multi-core and big data. In this paper, we argue that a message based interface should be adopted to replace the traditional bus-based interface in the memory system. A novel message interface based memory system called MIMS is proposed. The key innovation of MIMS is that processors communicate with the memory system through a universal and flexible message packet interface. Each message packet is allowed to encapsulate multiple memory requests (or commands) and additional semantic information. The memory system is more intelligent and active by equipping with a local buffer scheduler, which is responsible for processing packets, scheduling memory requests, preparing responses, and executing specific commands with the help of semantic information. Under the MIMS framework, many previous innovations on memory architecture as well as new optimization opportunities such as address compression and continuous requests combination can be naturally incorporated. The experimental results on a 16-core cycle-detailed simulation system show that: with accurate granularity message, MIMS can improve system performance by 53.21% and reduce energy delay product (EDP) by 55.90%. Furthermore, it can improve effective bandwidth utilization by 62.42% and reduce memory access latency by 51% on average.

international conference on cluster computing | 2012

Evaluation and Optimization of Breadth-First Search on NUMA Cluster

Zehan Cui; Licheng Chen; Mingyu Chen; Yungang Bao; Yongbing Huang; Huiwei Lv

Graph is widely used in many areas. Breadth-First Search (BFS), a key subroutine for many graph analysis algorithms, has become the primary benchmark for Graph500 ranking. Due to the high communication cost of BFS, multi-socket nodes with large memory capacity (NUMA) are supposed to reduce network pressure. However, the longer latency to remote memory may cause problem if not treated well. In this work, we first demonstrate that simply spawning and binding one MPI process for each socket can achieve the best performance for MPI/OpenMP hybrid programmed BFS algorithm, resulting in 1.53X of performance on 16 nodes. Nevertheless, we notice that one MPI process per socket may exacerbate the communication cost. We propose to share some communication data structure among the processes inside the same node, to eliminate most of the intra-node communication. To fully utilize the network bandwidth, we make all the processes in a node to perform communication simultaneously. We further adjust the granularity of a key bitmap for better cache locality to speed up the computation. With all the optimizations for NUMA, communication and computation together, 2.44X of performance is achieved on 16 nodes, which is 39.2 Billion Traversed Edges per Second for an R-MAT graph of scale 32 (4 billion vertices and 64 billion edges).

ACM Transactions on Architecture and Code Optimization | 2014

HMTT: A hybrid hardware/software tracing system for bridging the DRAM access trace's semantic gap

Yongbing Huang; Licheng Chen; Zehan Cui; Yuan Ruan; Yungang Bao; Mingyu Chen; Ninghui Sun

DRAM access traces (i.e., off-chip memory references) can be extremely valuable for the design of memory subsystems and performance tuning of software. Hardware snooping on the off-chip memory interface is an effective and nonintrusive approach to monitoring and collecting real-life DRAM accesses. However, compared with software-based approaches, hardware snooping approaches typically lack semantic information, such as process/function/object identifiers, virtual addresses, and lock contexts, that is essential to the complete understanding of the systems and software under investigation. In this article, we propose a hybrid hardware/software mechanism that is able to collect off-chip memory reference traces with semantic information. We have designed and implemented a prototype system called HMTT (Hybrid Memory Trace Tool), which uses a custom-made DIMM connector to collect off-chip memory references and a high-level event-encoding scheme to correlate semantic information with memory references. In addition to providing complete, undistorted DRAM access traces, the proposed system is also able to perform various types of low-overhead profiling, such as object-relative accesses and multithread lock accesses.

international conference on computer design | 2013

Scattered superpage: A case for bridging the gap between superpage and page coloring

Licheng Chen; Yanan Wang; Zehan Cui; Yongbing Huang; Yungang Bao; Mingyu Chen

Superpage and page coloring are two important practical techniques to improve the performance of Translation Lookaside Buffers (TLBs) and shared Last Level Cache (LLC) respectively. However, there exists a gap between these two techniques in current hardware-architecture design, resulting in the contradiction in adopting these two optimizations simultaneously: a superpage requires hundreds of contiguous (e.g. a power of two) base pages in both virtual and physical memory, which would compulsorily occupy all available page colors (or cache sets), thus making page coloring failed to work. This is because most contemporary architecture adopts the design with cache set indexes placed in the least significant part of block address. In this paper, we propose a lightweight approach named Scattered Superpage to bridge this gap. Scattered Superpage decouples a superpage from the limitation of occupying multiple contiguous physical base pages. A superpage is still contiguous in virtual memory, but it is scattered mapping into multiple physical superpages, and it just occupies specified partial page colors in each physical superpage, thus it allows us to configure page color for each superpage. The huge TLB is slightly modified to store page color configuration for each superpage and to calculate target physical address based on this configuration when doing address translation. The experimental results show that the Scattered Superpage can improve system performance by 20.51% and reduce unfairness by 27.77% in our 4-core simulation system (with multi-program memory-intensive workloads). It achieves this by reducing last level cache miss by 17.05% and reducing TLB miss by 86.02% simultaneously.

Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness | 2012

Trace-driven simulation of memory system scheduling in multithread application

Pengfei Zhu; Mingyu Chen; Yungang Bao; Licheng Chen; Yongbing Huang

Along with commercial chip-multiprocessors (CMPs) integrating more and more cores, memory systems are playing an increasingly important role in multithread applications. Currently, trace-driven simulation is widely adopted in memory system scheduling research, since it is faster than execution-driven simulation and does not require data computation. On the contrary, due to the same reason, its trace replay for concurrent thread execution lacks data information and contains only addresses, so misplacement occurs in simulations when the trace of one thread runs ahead or behind others. This kind of distortion can cause remarkable errors during research. As shown in our experiment, trace misplacement causes an error rate of up to 10.22% in the metrics, including weighted IPC speedup, harmonic mean of IPC, and CPI throughput. This paper presents a methodology to avoid trace misplacement in trace-driven simulation and to ensure the accuracy of memory scheduling simulation in multithread applications, thus revealing a reliable means to study inter-thread actions in memory systems.

international symposium on performance analysis of systems and software | 2012