Jean-Francois Collard

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jean-Francois Collard is active.

Explore More

Publication

Featured researches published by Jean-Francois Collard.

international symposium on microarchitecture | 2006

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers

Jack Sampson; Ruben Gonzalez; Jean-Francois Collard; Norman P. Jouppi; Michael S. Schlansker; Brad Calder

We examine the ability of CMPs, due to their lower on-chip communication latencies, to exploit data parallelism at inner-loop granularities similar to that commonly targeted by vector machines. Parallelizing code in this manner leads to a high frequency of barriers, and we explore the impact of different barrier mechanisms upon the efficiency of this approach. To further exploit the potential of CMPs for fine-grained data parallel tasks, we present barrier filters, a mechanism for fast barrier synchronization on-chip multi-processors to enable vector computations to be efficiently distributed across the cores of a CMP. We ensure that all threads arriving at a barrier require an unavailable cache line to proceed, and, by placing additional hardware in the shared portions of the memory subsystem, we starve their requests until they all have arrived. Specifically, our approach uses invalidation requests to both make cache lines unavailable and identify when a thread has reached the barrier. We examine two types of barrier filters, one synchronizing through instruction cache lines, and the other through data cache lines

ACM Sigarch Computer Architecture News | 2005

Fast synchronization for chip multiprocessors

Jack Sampson; Ruben Gonzalez; Jean-Francois Collard; Norman P. Jouppi; Michael S. Schlansker

This paper presents a novel mechanism for barrier synchronization on chip multi-processors (CMPs). By forcing the invalidation of selected I-cache lines, this mechanism starves threads and thus forces their execution to stop. Threads are let free when all have entered the barrier.We evaluated this mechanism using SMTSim and report much better (and most importantly, more flat) performance than lock-based barriers supported by existing microprocessors.

international conference on supercomputing | 2005

The architecture of the HP Superdome shared-memory multiprocessor

Gary Gostin; Jean-Francois Collard; Kirby L. Collins

This paper offers an overview of the HP Superdome shared memory multiprocessor, along with a detailed description of the cache coherence implementation. (An early and limited description was provided in [9].) We focus in particular on the sx1000 chipset, codenamed Pinnacles, which is used in HPs Integrity line of servers (Superdome Integrity, rx8620, rx7620) and in HPs 9000 series of PA-RISC based products (Superdome. rp8620, rp7420). The design goals for this architecture were to provide a platform that supported both PA-RISC and Itanium family processors, support multiple operating systems including HP-UX, Windows, and Linux, provide cache coherent scalability to large ways of MP. and support multiple product generations while preserving customer investments in memory and I/O infrastructure. This paper covers the system organization and network topology (Section 2), details on how processor instructions appear as coherence transactions (Section 3), the cache coherence protocol (Section 4). and microarchitectural details on the chipset (Section 5). It concludes with examples of application benchmarks that demonstrate the scalability achieved (Section 6).

acm sigplan symposium on principles and practice of parallel programming | 2005

System-wide performance monitors and their application to the optimization of coherent memory accesses

Jean-Francois Collard; Norman P. Jouppi; Sami Yehia

Inspired by recent advances in microprocessor performance monitors, this paper shows how a shared-memory multiprocessor chipset and interconnect can be equipped with performance monitors that associate performance events with the PCs of the individual instructions causing these events. Such monitors greatly simplify performance debugging of shared-memory programs---for example, they make finding pairs of instructions in false sharing straightforward. These monitors also enable precise feedback-directed compiler optimizations and, as a second contribution, we show how they can guide the code generator to use the version of the load instruction that makes the best use of the coherence protocol. Experiments show up to almost 10% coherence traffic reduction on SPLASH2 applications.

Archive | 2004