Yungang Bao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yungang Bao is active.

Explore More

Publication

Featured researches published by Yungang Bao.

international conference on parallel architectures and compilation techniques | 2012

A software memory partition approach for eliminating bank-level interference in multicore systems

Lei Liu; Zehan Cui; Mingjie Xing; Yungang Bao; Mingyu Chen; Chengyong Wu

Main memory system is a shared resource in modern multicore machines, resulting in serious interference, which causes performance degradation in terms of throughput slowdown and unfairness. Numerous new memory scheduling algorithms have been proposed to address the interference problem. However, these algorithms usually employ complex scheduling logic and need hardware modification to memory controllers, as a result, industrial venders seem to have some hesitation in adopting them.

ieee international conference on high performance computing data and analytics | 2011

Fast implementation of DGEMM on Fermi GPU

Guangming Tan; Linchuan Li; Sean Triechle; Everett H. Phillips; Yungang Bao; Ninghui Sun

In this paper we present a thorough experience on tuning double-precision matrix-matrix multiplication (DGEM-M) on the Fermi GPU architecture. We choose an optimal algorithm with blocking in both shared memory and registers to satisfy the constraints of the Fermi memory hierarchy. Our optimization strategy is further guided by a performance modeling based on micro-architecture benchmarks. Our optimizations include software pipelining, use of vector memory operations, and instruction scheduling. Our best CUDA algorithm achieves comparable performance with the latest CUBLAS library1. We further improve upon this with an implementation in the native machine language, leading to 20% increase in performance. That is, the achieved peak performance (efficiency) is improved from 302Gflop/s (58%) to 362Gflop/s (70%).

international symposium on computer architecture | 2014

Going vertical in memory management: handling multiplicity by multi-policy

Lei Liu; Yong Li; Zehan Cui; Yungang Bao; Mingyu Chen; Chueh-Hung Wu

Many emerging applications from various domains often exhibit heterogeneous memory characteristics. When running in combination on parallel platforms, these applications present a daunting variety of workload behaviors that challenge the effectiveness of any memory allocation strategy. Prior partitioning-based or random memory allocation schemes typically manage only one level of the memory hierarchy and often target specific workloads. To handle diverse and dynamically changing memory and cache allocation needs, we augment existing “horizontal” cache/DRAM bank partitioning with vertical partitioning and explore the resulting multi-policy space. We study the performance of these policies for over 2000 workloads and correlate the results with application characteristics via a data mining approach. Based on this correlation we derive several practical memory allocation rules that we integrate into a unified multi-policy framework to guide resources partitioning and coalescing for dynamic and diverse multi-programmed/threaded workloads. We implement our approach in Linux kernel 2.6.32 as a restructured page indexing system plus a series of kernel modules. Extensive experiments show that, in practice, our framework can select proper memory allocation policy and consistently outperforms the unmodified Linux kernel, achieving up to 11% performance gains compared to prior techniques.

measurement and modeling of computer systems | 2009

Extending Amdahl's law in the multicore era

Erlin Yao; Yungang Bao; Guangming Tan; Mingyu Chen

The scalability problem is in the first place of the dozen long-term information-technology research goals indicated by Jim Gray [2]. Chip multiprocessors (CMPs) or multicores are emerging as the dominant computing platform. In the multicore era, the scalability problem is still an interesting long-term goal, and it will become more urgent in the next decade. Hill and Marty [4] augment Amdahl’s law to multicore hardware by constructing a cost model for the number and performance of cores that the chip can support. They conclude that obtaining optimal multicore performance will require further research in both extracting more parallelism and making sequential cores faster. Woo and Lee [6] develop Hill’s work by taking power and energy into account. The revised models provide computer architects with a better understanding of multicore scalability, enabling them to make more informed tradeoffs. However, as far as we know, no work has investigated theoretical analysis of these types of works, existing works are all carried out using programs and experiments. This paper investigates the theoretical analysis of multicore scalability. For asymmetric multicore chips, although the architecture of using one large core and many base cores is assumed originally for simplicity, it is proved to be the optimal architecture in the sense of speedup. The potentials of the maximum of speedups using architecture of symmetric, asymmetric or dynamic multicore are obtained. Given the parallel fraction, performance index and the number of base core resources, precise quantitative conditions are given to determine how to obtain optimal multicore performance. Our quantitative analysis not only explains Hill’s work [4] theoretically, but also extends their result to a more general framework. The analytical tools in this paper can also be used to the theoretical analysis of Woo and Lee’s works [6].

high-performance computer architecture | 2010

DMA cache: Using on-chip storage to architecturally separate I/O data from CPU data for improving I/O performance

Dan Tang; Yungang Bao; Weiwu Hu; Mingyu Chen

As technology advances both in increasing bandwidth and in reducing latency for I/O buses and devices, moving I/O data in/out memory has become critical. In this paper, we have observed the different characteristics of I/O and CPU memory reference behavior, and found the potential benefits of separating I/O data from CPU data. We propose a DMA cache technique to store I/O data in dedicated on-chip storage and present two DMA cache designs. The first design, Decoupled DMA Cache (DDC), adopts additional on-chip storage as the DMA cache to buffer I/O data. The second design, Partition-Based DMA Cache (PBDC), does not require additional on-chip storage, but can dynamically use some ways of the processors last level cache (LLC) as the DMA cache. We have implemented and evaluated the two DMA cache designs by using an FPGA-based emulation platform and the memory reference traces of real-world applications. Experimental results show that, compared with the existing snooping-cache scheme, DDC can reduce memory access latency (in bus cycles) by 34.8% on average (up to 58.4%), while PBDC can achieve about 80% of DDCs performance improvements despite no additional on-chip storage.

virtual execution environments | 2014

CMD: classification-based memory deduplication through page access characteristics

Licheng Chen; Zhipeng Wei; Zehan Cui; Mingyu Chen; Haiyang Pan; Yungang Bao

Limited main memory size is considered as one of the major bottlenecks in virtualization environments. Content-Based Page Sharing (CBPS) is an efficient memory deduplication technique to reduce server memory requirements, in which pages with same content are detected and shared into a single copy. As the widely used implementation of CBPS, Kernel Samepage Merging (KSM) maintains the whole memory pages into two global comparison trees (a stable tree and an unstable tree). To detect page sharing opportunities, each tracked page needs to be compared with pages already in these two large global trees. However since the vast majority of compared pages have different content with it, that will induce massive futility comparisons and thus heavy overhead. In this paper, we propose a lightweight page Classification-based Memory Deduplication approach named CMD to reduce futile page comparison overhead meanwhile to detect page sharing opportunities efficiently. The main innovation of CMD is that pages are grouped into different classifications based on page access characteristics. Pages with similar access characteristics are suggested to have higher possibility with same content, thus they are grouped into the same classification. In CMD, the large global comparison trees are divided into multiple small trees with dedicated local ones in each page classification. Page comparisons are performed just in the same classification, and pages from different classifications are never compared (since they probably result in futile comparisons). The experimental results show that CMD can efficiently reduce page comparisons (by about 68.5%) meanwhile detect nearly the same (by more than 98%) or even more page sharing opportunities.

international conference on parallel architectures and compilation techniques | 2012

HaLock: hardware-assisted lock contention detection in multithreaded applications

Yongbing Huang; Zehan Cui; Licheng Chen; Wenli Zhang; Yungang Bao; Mingyu Chen

Multithreaded programming relies on locks to ensure the consistency of shared data. Lock contention is the main reason of low parallel efficiency and poor scalability of multithreaded programs. Lock profiling is the primary approach to detect lock contention. Prior lock profiling tools are able to track lock behaviors but directly store profiling data into local memory regardless of the memory interference on targeted programs.

2011 International Green Computing Conference and Workshops | 2011

A fine-grained component-level power measurement method

Zehan Cui; Yan Zhu; Yungang Bao; Mingyu Chen

The ever growing energy consumption of computer systems have become a more and more serious problem in the past few years. Power profiling is a fundamental way for us to better understand where, when and how energy is consumed. This paper presents a direct measurement method to measure the power of main computer components with fine time granularity. To achieve this goal, only small amount of extra hardware are employed. An approach to synchronize power dissipation with program phases has also been proposed in this paper. Based on the preliminary version of our tools, we measure the power of CPU, memory and disk when running SPEC CPU2006 benchmarks, and prove that measurement with fine time granularity is essential. The phenomenon we observe from memory power may be served as a guide for memory management or architecture design towards energy efficiency.

IEEE Transactions on Computers | 2015

Statistical Performance Comparisons of Computers

Tianshi Chen; Qi Guo; Olivier Temam; Yue Wu; Yungang Bao; Zhiwei Xu; Yunji Chen

As a fundamental task in computer architecture research, performance comparison has been continuously hampered by the variability of computer performance. In traditional performance comparisons, the impact of performance variability is usually ignored (i.e., the means of performance measurements are compared regardless of the variability), or in the few cases where it is factored in using parametric confidence techniques, the confidence is either erroneously computed based on the distribution of performance measurements (with the implicit assumption that it obeys the normal law), instead of the distribution of sample mean of performance measurements, or too few measurements are considered for the distribution of sample mean to be normal. We first illustrate how such erroneous practices can lead to incorrect comparisons. Then, we propose a non-parametric Hierarchical Performance Testing (HPT) framework for performance comparison, which is significantly more practical than standard parametric techniques because it does not require to collect a large number of measurements in order to achieve a normal distribution of the sample mean. This HPT framework has been implemented as an open-source software.

international conference on supercomputing | 2014

DTail: a flexible approach to DRAM refresh management

Zehan Cui; Sally A. McKee; Zhongbin Zha; Yungang Bao; Mingyu Chen

DRAM cells must be refreshed (or rewritten) periodically to maintain data integrity, and as DRAM density grows, so does the refresh time and energy. Not all data need to be refreshed with the same frequency, though, and thus some refresh operations can safely be delayed. Tracking such information allows the memory controller to reduce refresh costs by judiciously choosing when to refresh different rows Solutions that store imprecise information miss opportunities to avoid unnecessary refresh operations, but the storage for tracking complete information scales with memory capacity. We therefore propose a flexible approach to refresh management that tracks complete refresh information within the DRAM itself, where it incurs negligible storage costs (0.006% of total capacity) and can be managed easily in hardware or software. Completely tracking multiple types of refresh information (e.g., row retention time and data validity) maximizes refresh reduction and lets us choose the most effective refresh schemes. Our evaluations show that our approach saves 25-82% of the total DRAM energy over prior refresh-reduction mechanisms.

Explore More