Yaoqing Gao | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yaoqing Gao is active.

Explore More

Publication

Featured researches published by Yaoqing Gao.

international conference on parallel architectures and compilation techniques | 2011

Linear-time Modeling of Program Working Set in Shared Cache

Xiaoya Xiang; Bin Bao; Chen Ding; Yaoqing Gao

Many techniques characterize the program working set by the notion of the program footprint, which is the volume of data accessed in a time window. A complete characterization requires measuring data access in all O(n^2) windows in an n-element trace. Two recent techniques have significantly reduced the measurement time, but the cost is still too high for real-size workloads. Instead of measuring all footprint sizes, this paper presents a technique for measuring the average footprint size. By confining the analysis to the average rather than the full range, the problem can be solved accurately by a linear-time algorithm. The paper presents the algorithm and evaluates it using the complete suites of 26 SPEC2000 and 29 SPEC2006 benchmarks. The new algorithm is compared against the previously fastest algorithm in both the speed of the measurement and the accuracy of shared-cache performance prediction.

symposium on code generation and optimization | 2010

Exploiting statistical correlations for proactive prediction of program behaviors

Yunlian Jiang; Eddy Z. Zhang; Kai Tian; Feng Mao; Malcom Gethers; Xipeng Shen; Yaoqing Gao

This paper presents a finding and a technique on program behavior prediction. The finding is that surprisingly strong statistical correlations exist among the behaviors of different program components (e.g., loops) and among different types of program level behaviors (e.g., loop trip-counts versus data values). Furthermore, the correlations can be beneficially exploited: They help resolve the proactivity-adaptivity dilemma faced by existing program behavior predictions, making it possible to gain the strengths of both approaches--the large scope and earliness of offline-profiling--based predictions, and the cross-input adaptivity of runtime sampling-based predictions. The main technique contributed by this paper centers on a new concept, seminal behaviors. Enlightened by the existence of strong correlations among program behaviors, we propose a regression based framework to automatically identify a small set of behaviors that can lead to accurate prediction of other behaviors in a program. We call these seminal behaviors. By applying statistical learning techniques, the framework constructs predictive models that map from seminal behaviors to other behaviors, enabling proactive and cross-input adaptive prediction of program behaviors. The prediction helps a commercial compiler, the IBM XL C compiler, generate code that runs up to 45% faster (5%-13% on average), demonstrating the large potential of correlation-based techniques for program optimizations.

international symposium on memory management | 2008

MPADS: memory-pooling-assisted data splitting

Stephen Curial; Peng Zhao; José Nelson Amaral; Yaoqing Gao; Shimin Cui; Raul Esteban Silvera; Roch Georges Archambault

This paper describes Memory-Pooling-Assisted Data Splitting (MPADS), a framework that combines data structure splitting with memory pooling --- Although it MPADS may call to mind memory padding, a distintion of this framework is that is does not insert padding. MPADS relies on pointer analysis to ensure that splitting is safe and applicable to type-unsafe language. MPADS makes no assumption about type safety. The analysis can identify cases in which the transformation could lead to incorrect code and thus MPADS abandons those cases. To make data structure splitting efficient in a commercial compiler, MPADS is designed with great attention to reduce the number of instructions required to access the data after the data-structure splitting. Moreover the implementation of MPADS reveals that architecture details should be considered carefully when re-arranging data allocation. For instance one of the most significant gains from the introduction of data-structure splitting in code targetting the IBM POWER architecture is a dramatic decrease in the amount of data prefetched by the hardware prefetch engine without a noticeable decrease in the cache utilization. Triggering fewer hardware prefetch streams frees memory bandwidth and cache space. Fewer prefetching streams also reduce the interference between the data accessed by multiple cores in modern multicore processors.

international conference on supercomputing | 2005

Lightweight reference affinity analysis

Xipeng Shen; Yaoqing Gao; Chen Ding; Roch Georges Archambault

Previous studies have shown that array regrouping and structure splitting significantly improve data locality. The most effective technique relies on profiling every access to every data element. The high overhead impedes its adoption in a general compiler, In this paper, we show that for array regrouping in scientific programs, the overhead is not needed since the same benefit can be obtained by pure program analysis.We present an interprocedural analysis technique for array regrouping. For each global array, the analysis summarizes the access pattern by access-frequency vectors and then groups arrays with similar vectors. The analysis is context sensitive, so it tracks the exact array access. For each loop or function call, it uses two methods to estimate the frequency of the execution. The first is symbolic analysis in the compiler. The second is lightweight profiling of the code. The same interprocedural analysis is used to cumulate the overall execution frequency by considering the calling context. We implemented a prototype of both the compiler and the profiling analysis in the IBM® compiler, evaluated array regrouping on the entire set of SPEC CPU2000 FORTRAN benchmarks, and compared different analysis methods. The pure compiler-based array regrouping improves the performance for the majority of programs, leaving little room for improvement by code or data profiling.

ACM Transactions on Programming Languages and Systems | 2007

Forma : A framework for safe automatic array reshaping

Peng Zhao; Shimin Cui; Yaoqing Gao; Raul Esteban Silvera; José Nelson Amaral

This article presents Forma, a practical, safe, and automatic data reshaping framework that reorganizes arrays to improve data locality. Forma splits large aggregated data-types into smaller ones to improve data locality. Arrays of these large data types are then replaced by multiple arrays of the smaller types. These new arrays form natural data streams that have smaller memory footprints, better locality, and are more suitable for hardware stream prefetching. Forma consists of a field-sensitive alias analyzer, a data type checker, a portable structure reshaping planner, and an array reshaper. An extensive experimental study compares different data reshaping strategies in two dimensions: (1) how the data structure is split into smaller ones (maximal partition × frequency-based partition × affinity-based partition); and (2) how partitioned arrays are linked to preserve program semantics (address arithmetic-based reshaping × pointer-based reshaping). This study exposes important characteristics of array reshaping. First, a practical data reshaper needs not only an inter-procedural analysis but also a data-type checker to make sure that array reshaping is safe. Second, the performance improvement due to array reshaping can be dramatic: standard benchmarks can run up to 2.1 times faster after array reshaping. Array reshaping may also result in some performance degradation for certain benchmarks. An extensive micro-architecture-level performance study identifies the causes for this degradation. Third, the seemingly naive maximal partition achieves best or close-to-best performance in the benchmarks studied. This article presents an analysis that explains this surprising result. Finally, address-arithmetic-based reshaping always performs better than its pointer-based counterpart.

languages and compilers for parallel computing | 2008

P-OPT: Program-Directed Optimal Cache Management

Xiaoming Gu; Tongxin Bai; Yaoqing Gao; Chengliang Zhang; Roch Georges Archambault; Chen Ding

As the amount of on-chip cache increases as a result of Moores law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two new optimal algorithms for Program-directed OPTimal cache management (P-OPT) , in which a program designates certain accesses as bypasses and trespasses through an extended hardware interface to effect optimal cache utilization. The paper proves the optimality of the new methods, examines their theoretical properties, and shows the potential benefit using a simulation study and a simple test on a multi-core, multi-processor PC.

conference on object-oriented programming systems, languages, and applications | 2014

Space-efficient multi-versioning for input-adaptive feedback-driven program optimizations

Mingzhou Zhou; Xipeng Shen; Yaoqing Gao; Graham Yiu

Function versioning is an approach to addressing input-sensitivity of program optimizations. A major side effect of it is notable code size increase, which has been hindering its broad applications to large code bases and space-stringent environments. In this paper, we initiate a systematic exploration into the problem, providing answers to some fundamental questions: Given a space constraint, to which function we should apply versioning? How many versions of a function should we include in the final executable? Is the optimal selection feasible to do in polynomial time? This study proves selecting the best set of versions under a space constraint is NP-complete and proposes a heuristic algorithm named CHoGS which yields near optimal results in quadratic time. We implement the algorithm and conduct experiments through the IBM XL compilers. We observe significant performance enhancement with only slight code size increase; the results from CHoGS show factors of higher space efficiency than those from traditional hotness-based methods.

european conference on object oriented programming | 2013

Simple profile rectifications go a long way

Bo Wu; Mingzhou Zhou; Xipeng Shen; Yaoqing Gao; Raul Esteban Silvera; Graham Yiu

Feedback-driven program optimization (FDO) is common in modern compilers, including Just-In-Time compilers increasingly adopted for object-oriented or scripting languages. This paper describes a systematic study in understanding and alleviating the effects of sampling errors on the usefulness of the obtained profiles for FDO. Taking a statistical approach, it offers a series of counter-intuitive findings, and identifies two kinds of profile errors that affect FDO critically, namely zero-count errors and inconsistency errors. It further proposes statistical profile rectification, a simple approach to correcting profiling errors by leveraging statistical patterns in a profile. Experiments show that the simple approach enhances the effectiveness of sampled profile-based FDO dramatically, increasing the average FDO speedup from 1.16X to 1.3X, around 92% of what full profiles can yield.

cluster computing and the grid | 2012

Delta Send-Recv for Dynamic Pipelining in MPI Programs

Bin Bao; Chen Ding; Yaoqing Gao; Roch Georges Archambault

Pipelining is necessary for efficient do-across parallelism but the use is difficult to automate because it requires send-receive analysis and loop blocking in both sender and receiver code. The blocking factor is statically chosen. This paper presents a new interface called delta send-recv. Through compiler and run-time support, it enables dynamic pipelining. In program code, the interface is used to mark the related computation and communication. There is no need to restructure the computation code or compose multiple messages. At run time, the message size is dynamically determined, and multiple pipelines are chained among all tasks that participate in the delta communication. The new system is tested on kernel and reduced NAS benchmarks to show that it simplifies message-passing programming and improves program performance.

languages and compilers for parallel computing | 2010

Array regrouping on CMP with non-uniform cache sharing

Yunlian Jiang; Eddy Z. Zhang; Xipeng Shen; Yaoqing Gao; Roch Georges Archambault

Array regrouping enhances program spatial locality by interleaving elements of multiple arrays that tend to be accessed closely. Its effectiveness has been systematically studied for sequential programs running on unicore processors, but not for multithreading programs on modern ChipMultiprocessor (CMP) machines. On one hand, the processor-level parallelism on CMP intensifies memory bandwidth pressure, suggesting the potential benefits of array regrouping for CMP computing. On the other hand, CMP architectures exhibit extra complexities-- especially the hierarchical, heterogeneous cache sharing among hyperthreads, cores, and processors--that impose new challenges to array regrouping. In this work, we initiate an exploration to the new opportunities and challenges. We propose cache-sharing-aware reference affinity analysis for identifying data affinity in multithreading applications. The analysis consists of affinity-guided thread scheduling and hierarchical reference-vector merging, handles cache sharing among both hyperthreads and cores, and offers hints for array regrouping and the avoidance of false sharing. Preliminary experiments demonstrate the potential of the techniques in improving locality of multithreading applications on CMP with various pitfalls avoided.

Explore More