Stéphane Zuckerman | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stéphane Zuckerman is active.

Explore More

Publication

Featured researches published by Stéphane Zuckerman.

Microprocessors and Microsystems | 2014

TERAFLUX: Harnessing dataflow in next generation teradevices

Roberto Giorgi; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Rahulkumar Gayatri; Sylvain Girbal; Daniel Goodman; Behram Khan; Souad Koliai; Joshua Landwehr; Nhat Minh Lê; Feng Li; Mikel Luján; Avi Mendelson; Laurent Morin; Nacho Navarro; Tomasz Patejko; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Ian Watson; Sebastian Weis; Stéphane Zuckerman; Mateo Valero

The improvements in semiconductor technologies are gradually enabling extreme-scale systems such as teradevices (i.e., chips composed by 1000 billion of transistors), most likely by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper presents an overview of the research carried out by the TERAFLUX partners and some preliminary results. Our platform comprises 1000+ general purpose cores per chip in order to properly explore the above challenges. An architectural template has been proposed and applications have been ported to the platform. Programming models, compilation tools, and reliability techniques have been developed. The evaluation is carried out by leveraging on modifications of the HP-Labs COTSon simulator.

international conference on parallel processing | 2013

An implementation of the codelet model

Joshua Suettlerlein; Stéphane Zuckerman; Guang R. Gao

Chip architectures are shifting from few, faster, functionally heavy cores to abundant, slower, simpler cores to address pressing physical limitations such as energy consumption and heat expenditure. As architectural trends continue to fluctuate, we propose a novel program execution model, the Codelet model, which is designed for new systems tasked with efficiently managing varying resources. The Codelet model is a fine-grained dataflow inspired model extended to address the cumbersome resources available in new architectures. In the following, we define the Codelet execution model as well as provide an implementation named DARTS. Utilizing DARTS and two predominant kernels, matrix multiplication and the Graph 500s breadth first search, we explore the validity of fine-grain execution as a promising and viable execution model for future and current architectures. We show that our runtime is on par or performs better than AMDs highly-optimized parallel library for matrix multication, outperforming it on average by 1.40× with a speedup up to 4×. Our implementation of the parallel BFS outperforms Graph 500s reference implementation (with or without dynamic scheduling) on average by 1.50× with a speed up of up to 2.38×.

languages and compilers for parallel computing | 2009

A balanced approach to application performance tuning

Souad Koliai; Stéphane Zuckerman; Emmanuel Oseret; Mickaël Ivascot; Tipp Moseley; Dinh Quang; William Jalby

Current hardware trends place increasing pressure on programmers and tools to optimize scientific code. Numerous tools and techniques exist, but no single tool is a panacea; instead, different tools have different strengths. Therefore, an assortment of performance tuning utilities and strategies are necessary to best utilize scarce resources (e.g., bandwidth, functional units, cache). This paper describes a combined methodology for the optimization process. The strategy combines static assembly analysis using MAQAO with dynamic information from hardware performance monitoring (HPM) and memory traces. We introduce a new technique, decremental analysis (DECAN), to iteratively identify the individual instructions responsible for performance bottlenecks. We present case studies on applications from several independent software vendors (ISVs) on a SMP Xeon Core 2 platform. These strategies help discover problems related to memory access locality and loop unrolling that lead to a sequential performance improvement of a factor of 2.

ieee international conference on high performance computing, data, and analytics | 2008

Fine tuning matrix multiplications on multicore

Stéphane Zuckerman; Marc Pérache; William Jalby

Multicore systems are becoming ubiquituous in scientificcomputing. As performance libraries are adapted to such systems, thedifficulty to extract the best performance out of them is quite high. Indeed,performance libraries such as Intels MKL, while performing verywell on unicore architectures, see their behaviour degrade when used onmulticore systems. Moreover, even multicore systems show wide differencesamong each other (presence of shared caches, memory bandwidth,etc.) We propose a systematic method to improve the parallel executionof matrix multiplication, through the study of the behavior of unicoreDGEMM kernels in MKL, as well as various other criteria. We show thatour fine-tuning can out-perform Intels parallel DGEMM of MKL, withperformance gains sometimes up to a factor of two.

international conference on conceptual structures | 2014

A Dataflow Programming Language and its Compiler for Streaming Systems

Haitao Wei; Stéphane Zuckerman; Xiaoming Li; Guang R. Gao

Abstract The dataflow programming paradigm shows an important way to improve programming productivity for streaming systems. In this paper we propose COStream, a programming language based on synchronous data flow execution model for data-driven application. We also propose a compiler framework for COStream on general-purpose multi-core architectures. It features an inter-thread software pipelining scheduler to exploit the parallelism among the cores. We implemented the COStream compiler framework on x86 multi-core architecture and performed experiments to evaluate the system.

european conference on parallel processing | 2013

Toward a Self-aware System for Exascale Architectures

Aaron Landwehr; Stéphane Zuckerman; Guang R. Gao

High-performance systems are evolving to a point where performance is no longer the sole relevant criterion. The current execution and resource management paradigms are no longer sufficient to ensure correctness and performance. Power requirements are presently driving the co-design of HPC systems, which in turn sets the course for a radical change in how to express the need for scarcer and scarcer resources, as well as how to manage them. It is our opinion that systems will need to become more introspective and self-aware with respect to performance, energy, and resiliency. In this position paper, we explore the major hardware requirements we believe are central to enabling introspection and self-awareness, as well as the types of interfaces and information that will be needed for such runtime systems. We also discuss a research path toward a self-aware system for exascale architectures.

international parallel and distributed processing symposium | 2017

Multigrain Parallelism: Bridging Coarse-Grain Parallel Programs and Fine-Grain Event-Driven Multithreading

Jaime Arteaga Molina; Stéphane Zuckerman; Guang R. Gao

The overwhelming wealth of parallelism exposed by Extreme-scale computing is rekindling the interest for finegrain multithreading, particularly at the intranode level. Indeed, popular parallel programming models, such as OpenMP, are integrating fine-grain tasking in their newest standards. Yet, classical coarse-grain constructs are still largely preferred, as they are considered simpler to express parallelism. In this paper, we present a Multigrain Parallel Programming environment that allows programmers to use these well-known coarse-grain constructs to generate a fine-grain multithreaded application to be run on top of a fine-grain event-driven program execution model. Experimental results with four scientific benchmarks (Graph500, NAS Data Cube, NWChem-SCF, and ExMatExs CoMD) show that fine-grain applications generated by and run on our environment are competitive and even outperform their OpenMP counterparts, especially for data-intensive workloads with irregular and dynamic parallelism, reaching speedups as high as 2.6x for Graph500 and 50x for NAS Data Cube.

2014 Fourth Workshop on Data-Flow Execution Models for Extreme Scale Computing | 2014

A Holistic Dataflow-Inspired System Design

Stéphane Zuckerman; Haitao Wei; Guang R. Gao; Howard Wong; Jean-Luc Gaudiot; Ahmed Louri

Computer systems have undergone a fundamental transformation recently, from single-core processors to devices with increasingly higher core counts within a single chip. The semi-conductor industry now faces the infamous power and utilization walls. To meet these challenges, heterogeneity in design, both at the architecture and technology levels, will be the prevailing approach for energy efficient computing as specialized cores, accelerators, etc., can eliminate the energy overheads of general-purpose homogeneous cores. However, with future technological challenges pointing in the direction of on-chip heterogeneity, and because of the traditional difficulty of parallel programming, it becomes imperative to produce new system software stacks that can take advantage of the heterogeneous hardware. As a case in point, the core count per chip continues to increase dramatically while the available on-chip memory per core is only getting marginally bigger. Thus, data locality, already a must-have in high-performance computing, will become even more critical as memory technology progresses. In turn, this makes it crucial that new execution models be developed to better exploit the trends of future heterogeneous computing in many-core chips. To solve these issues, we propose a cross-cutting cross-layer approach to address the challenges posed by future heterogeneous many-core chips.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

Towards Memory-Load Balanced Fast Fourier Transformations in Fine-Grain Execution Models

Chen Chen; Yao Wu; Stéphane Zuckerman; Guang R. Gao

The code let model is a fine-grain dataflow-inspired program execution model that balances the parallelism and overhead of the runtime system. It plays an important role in terms of performance, scalability, and energy efficiency in exascale studies such as the DARPA UHPC project and the DOE X-Stack project. As an important application, the Fast Fourier Transform (FFT) has been deeply studied in fine-grain models, including the code let model. However, the existing work focuses on how fine-grain models achieve more balanced workload comparing to traditional coarse-grain models. In this paper, we make an important observation that the flexibility of execution order of tasks in fine-grain models improves utilization of memory bandwidth as well. We use the code let model and the FFT application as a case study to show that a proper execution order of tasks (or code lets) can significantly reduce memory contention and thus improve performance. We propose an algorithm that provides a heuristic guidance of the execution order of the code lets to reduce memory contention. We implemented our algorithm on the IBM Cyclops-64 architecture. Experimental results show that our algorithm improves up to 46% performance compared to a state-of-the-art coarse-grain implementation of the FFT application on Cyclops-64.

languages and compilers for parallel computing | 2016

The Importance of Efficient Fine-Grain Synchronization for Many-Core Systems

Tongsheng Geng; Stéphane Zuckerman; José Monsalve; Alfredo Goldman; Sami J. Habib; Jean-Luc Gaudiot; Guang R. Gao

Current shared-memory systems can feature tens of processing elements. The old assumption that coarse-grain synchronization is enough in a shared-memory system thus becomes invalid. To efficiently take advantage of such systems, we propose to use fine grain synchronization, with event-driven multithreading. To illustrate our point, we study a naive 5-point 2D stencil kernel. We provide several synchronization variants using our fine-grain multithreading environment, and compare it to a naive coarse-grain implementation using OpenMP. We conducted experiments on three different many-core compute nodes, with speedups ranging from 1.2\(\times \) to 1.75\(\times \).

Explore More