Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Roch Georges Archambault is active.

Publication


Featured researches published by Roch Georges Archambault.


international symposium on memory management | 2008

MPADS: memory-pooling-assisted data splitting

Stephen Curial; Peng Zhao; José Nelson Amaral; Yaoqing Gao; Shimin Cui; Raul Esteban Silvera; Roch Georges Archambault

This paper describes Memory-Pooling-Assisted Data Splitting (MPADS), a framework that combines data structure splitting with memory pooling --- Although it MPADS may call to mind memory padding, a distintion of this framework is that is does not insert padding. MPADS relies on pointer analysis to ensure that splitting is safe and applicable to type-unsafe language. MPADS makes no assumption about type safety. The analysis can identify cases in which the transformation could lead to incorrect code and thus MPADS abandons those cases. To make data structure splitting efficient in a commercial compiler, MPADS is designed with great attention to reduce the number of instructions required to access the data after the data-structure splitting. Moreover the implementation of MPADS reveals that architecture details should be considered carefully when re-arranging data allocation. For instance one of the most significant gains from the introduction of data-structure splitting in code targetting the IBM POWER architecture is a dramatic decrease in the amount of data prefetched by the hardware prefetch engine without a noticeable decrease in the cache utilization. Triggering fewer hardware prefetch streams frees memory bandwidth and cache space. Fewer prefetching streams also reduce the interference between the data accessed by multiple cores in modern multicore processors.


international conference on supercomputing | 2005

Lightweight reference affinity analysis

Xipeng Shen; Yaoqing Gao; Chen Ding; Roch Georges Archambault

Previous studies have shown that array regrouping and structure splitting significantly improve data locality. The most effective technique relies on profiling every access to every data element. The high overhead impedes its adoption in a general compiler, In this paper, we show that for array regrouping in scientific programs, the overhead is not needed since the same benefit can be obtained by pure program analysis.We present an interprocedural analysis technique for array regrouping. For each global array, the analysis summarizes the access pattern by access-frequency vectors and then groups arrays with similar vectors. The analysis is context sensitive, so it tracks the exact array access. For each loop or function call, it uses two methods to estimate the frequency of the execution. The first is symbolic analysis in the compiler. The second is lightweight profiling of the code. The same interprocedural analysis is used to cumulate the overall execution frequency by considering the calling context. We implemented a prototype of both the compiler and the profiling analysis in the IBM® compiler, evaluated array regrouping on the entire set of SPEC CPU2000 FORTRAN benchmarks, and compared different analysis methods. The pure compiler-based array regrouping improves the performance for the majority of programs, leaving little room for improvement by code or data profiling.


languages and compilers for parallel computing | 2008

P-OPT: Program-Directed Optimal Cache Management

Xiaoming Gu; Tongxin Bai; Yaoqing Gao; Chengliang Zhang; Roch Georges Archambault; Chen Ding

As the amount of on-chip cache increases as a result of Moores law, cache utilization is increasingly important as the number of processor cores multiply and the contention for memory bandwidth becomes more severe. Optimal cache management requires knowing the future access sequence and being able to communicate this information to hardware. The paper addresses the communication problem with two new optimal algorithms for Program-directed OPTimal cache management (P-OPT) , in which a program designates certain accesses as bypasses and trespasses through an extended hardware interface to effect optimal cache utilization. The paper proves the optimality of the new methods, examines their theoretical properties, and shows the potential benefit using a simulation study and a simple test on a multi-core, multi-processor PC.


international workshop on openmp | 2004

Structure and algorithm for implementing OpenMP workshares

Guansong Zhang; Raul Esteban Silvera; Roch Georges Archambault

Although OpenMP has become the leading standard in parallel programming languages, the implementation of its runtime environment is not well discussed in the literature. In this paper, we introduce some of the key data structures required to implement OpenMP workshares in our runtime library and also discuss considerations on how to improve its performance. This includes items such as how to set up a workshare control block queue, how to initialize the data within a control block, how to improve barrier performance and how to handle implicit barrier and nowait situations. Finally, we discuss the performance of this implementation focusing on the EPCC benchmark.


cluster computing and the grid | 2012

Delta Send-Recv for Dynamic Pipelining in MPI Programs

Bin Bao; Chen Ding; Yaoqing Gao; Roch Georges Archambault

Pipelining is necessary for efficient do-across parallelism but the use is difficult to automate because it requires send-receive analysis and loop blocking in both sender and receiver code. The blocking factor is statically chosen. This paper presents a new interface called delta send-recv. Through compiler and run-time support, it enables dynamic pipelining. In program code, the interface is used to mark the related computation and communication. There is no need to restructure the computation code or compose multiple messages. At run time, the message size is dynamically determined, and multiple pipelines are chained among all tasks that participate in the delta communication. The new system is tested on kernel and reduced NAS benchmarks to show that it simplifies message-passing programming and improves program performance.


languages and compilers for parallel computing | 2010

Array regrouping on CMP with non-uniform cache sharing

Yunlian Jiang; Eddy Z. Zhang; Xipeng Shen; Yaoqing Gao; Roch Georges Archambault

Array regrouping enhances program spatial locality by interleaving elements of multiple arrays that tend to be accessed closely. Its effectiveness has been systematically studied for sequential programs running on unicore processors, but not for multithreading programs on modern ChipMultiprocessor (CMP) machines. On one hand, the processor-level parallelism on CMP intensifies memory bandwidth pressure, suggesting the potential benefits of array regrouping for CMP computing. On the other hand, CMP architectures exhibit extra complexities-- especially the hierarchical, heterogeneous cache sharing among hyperthreads, cores, and processors--that impose new challenges to array regrouping. In this work, we initiate an exploration to the new opportunities and challenges. We propose cache-sharing-aware reference affinity analysis for identifying data affinity in multithreading applications. The analysis consists of affinity-guided thread scheduling and hierarchical reference-vector merging, handles cache sharing among both hyperthreads and cores, and offers hints for array regrouping and the avoidance of false sharing. Preliminary experiments demonstrate the potential of the techniques in improving locality of multithreading applications on CMP with various pitfalls avoided.


Ibm Systems Journal | 2006

Using advanced compiler technology to exploit the performance of the Cell Broadband Engine TM architecture

Alexandre E. Eichenberger; John Kevin Patrick O'Brien; Kathryn M. O'Brien; Peng Wu; Tong Chen; P.H. Oden; D.A. Prener; Janice C. Shepherd; Byoungro So; Zehra Sura; Amy Wang; Tao Zhang; P. Zhao; Michael Karl Gschwind; Roch Georges Archambault; Y. Gao; R. Koo


Archive | 2000

Loop allocation for optimizing compilers

Roch Georges Archambault; Robert James Blainey


Archive | 2005

Method and apparatus for software-assisted data cache and prefetch control

Roch Georges Archambault; Yaoqing Gao; Francis O' Connell; Robert B. Tremaine; Michael E. Wazlowski; Steven Wayne White; Lixin Zhang


Archive | 2003

Scalable runtime system for global address space languages on shared and distributed memory machines

Roch Georges Archambault; Anthony Bolmarcich; G. Calin Cascaval; Siddhartha Chatterjee; Maria Eleftheriou; Raymond Ying Chau Mak

Researchain Logo
Decentralizing Knowledge