Sylvain Huet
Centre national de la recherche scientifique
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Sylvain Huet.
IEEE Transactions on Very Large Scale Integration Systems | 2008
Bertrand Le Gal; Emmanuel Casseau; Sylvain Huet
Multimedia applications such as video and image processing are often characterized by a huge number of data accesses. In many digital signal processing applications, array access patterns are regular and periodic. In these cases, optimized architectures using pipelined memory access controllers can be generated. In this paper, we focus on implementing memory interfacing modules that can be automatically generated from a high-level synthesis tool and which can efficiently handle predictable address patterns as well as random ones (i.e., dynamic address computations). The benefits of balancing dynamic address computations from datapath to dedicated computation units in the memory controller is also analyzed as well as operator bitwidth optimization and data locality to save power consumption and reduce latency.
international conference on computational science and its applications | 2008
Linda Kaouane; Dominique Houzet; Sylvain Huet
High performance computing with low cost machines becomes a reality. As an example, the Sony playstation3 gaming console offers performances up to 150 gflops for a machinepsilas retail price of
Journal of Real-time Image Processing | 2014
Vincent Boulos; Sylvain Huet; Vincent Fristot; Luc Salvo; Dominique Houzet
400. Unfortunately, higher performances are achieved when the programmer exploits the architectural specificities of its Cell processor: he has to focus on inter-processor communications, task allocations among the processors, task scheduling, external memory prefetching, and synchronization. In this paper, we propose and evaluate a compile flow that automates the transformation of a program expressed with the high level system design language SystemC used as a programming model, to its implementation on the Cell processor. SystemC constructs and scheduler are directly mapped to the Cell API, preserving their semantic. Inter-processor and external memory communications are abstracted by means of SystemC channels. We illustrate the approach on two case studies implemented on a Sony Playstation 3.
conference on design and architectures for signal and image processing | 2011
Sylvain Huet; Vincent Boulos; Vincent Fristot; Luc Salvo
Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization. In this paper, we propose a high level programming model based on a data flow graph (DFG) allowing an efficient implementation of digital signal processing applications on a multi-GPU computer cluster. This DFG-based design flow abstracts the underlying architecture. We focus particularly on the efficient implementation of communications by automating computation–communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on three experiments: a multi-host multi-gpu benchmark, a 3D granulometry application developed for research on materials and an application for computing visual saliency maps.
international conference on conceptual structures | 2010
Dominique Houzet; Sylvain Huet; Anis Rahman
Nowadays, it is possible to build a multi-GPU supercomputer, well suited for implementation of digital signal processing algorithms, for a few thousand dollars. However, to achieve the highest performance with this kind of architecture, the programmer has to focus on inter-processor communications, tasks synchronization … In this paper, we propose a design flow allowing an efficient implementation of a Digital Signal Processing (DSP) application specified as a Data Flow Graph (DFG) on a multi GPU computer cluster. We focus particularly on the effective implementation of communications by automating the computation-communication overlap, which can lead to significant speedups as shown in the presented benchmark. The approach is validated on a 3D granulometry application developed for research on materials.
digital systems design | 2010
Yun Jie Wu; Dominique Houzet; Sylvain Huet
High performance computing with low cost machines becomes a reality with GPU. Unfortunately, high performances are achieved when the programmer exploits the architectural specificities of the GPU processors: he has to focus on inter-GPU communications, task allocations among the GPUs, task scheduling, external memory prefetching, and synchronization. In this paper, we propose and evaluate a compile flow. It automates the transformation of a program expressed with the high level system design language SystemC, to its implementation on a cluster of multi-GPU. SystemC constructs and scheduler are directly mapped to the GPU API, preserving their semantic. Inter-GPU communications are abstracted by means of SystemC channels.
rapid system prototyping | 2005
Sylvain Huet; Emmanuel Casseau; Olivier Pasquier
The ever increasing density of integration makes the NoC a relevant communication design paradigm even for FPGAs. But NoC are always designed without considerations of applications and programming models, like busses and crossbars. Dealing with parallelism is still challenging. One way is to take into account the parallel programming model and application field in the design of the NoC, to reduce the semantic gap between application and implementation. In this paper we present a NoC and a design flow which target the implementation of streaming applications, e.g. image and video processing. The NoC topology is described as a matrix of routers (maybe a sparse matrix) mapped on a matrix of FPGAs for prototyping, which brings up a hierarchical dimension. Besides, the NoC has been developed in conjunction with a streaming programming model expressed with a subset of System C language. This allows optimizing the NoC by implementing the communication and synchronization primitives’mechanisms of the programming model directly in hardware: the size of such a router connected to 4 processing elements is about 2000 CLB from Xilinx FPGA, which is comparable with the size of a single processor. The design flow automates the implementation of an application expressed with a System C subset to a NoC based architecture.
Concurrency and Computation: Practice and Experience | 2016
Farouk Mansouri; Sylvain Huet; Dominique Houzet
Embedded signal processing systems are usually associated with real-time constraints and/or high data rates so that fully software implementation are often not satisfactory. In that case, mixed hardware/software implementations have to be investigated. However the increasing complexity of current applications makes classical design processes time consuming and consequently incompatible with an acceptable time to prototype. To address this problem, we propose a system-level design based methodology that aims at unifying the design flow from the functional description to the physical HW/SW implementation through functional and architectural flexibility. Our approach consists in automatically refining high abstraction level models through the use of an electronic system-level tool. We illustrate our methodology with the design of a wireless communication system.
digital systems design | 2006
Sylvain Huet; Emmanuel Casseau; Olivier Pasquier
Nowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi‐core and many‐core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low‐level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high‐level abstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high‐level PPM specific to digital signal‐processing applications. It is based on data‐flow graph models of computation, and a dynamic run‐time model of execution (StarPU). We show how the user can easily express this digital signal‐processing application and can take advantage of task, data, and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU and Xeon Phi). Copyright
international joint conference on computer vision imaging and computer graphics theory and applications | 2018
Aziliz Guezou-Philippe; Sylvain Huet; Denis Pellerin; Christian Graff
The most popular Moores law formulation, which states the number of transistors on integrated circuits doubles every 18 months, is said to hold for at least another two decades. According to this prediction, if we want to take advantage of technological evolutions, designers productivity has to increase in the same proportions. To take up this challenge, system level design solutions have been set up, but many efforts have still to be done on system modelling and synthesis. In this paper we propose a computation core synthesis methodology that can be integrated on the communication refinement steps of electronic system level design tools. In the proposed approach, computation cores used for digital signal processing application specifications relying on coarse grain communications and synchronizations (e.g. matrix) can be refined into computation cores which can handle fine grain communications and synchronizations (e.g. scalar). Its originality is its ability to synthesize computation cores which can handle fine grain data consumptions and productions which respect the intrinsic partial orders of the algorithms while preserving their original functionalities. Such cores can be used to model fine grain input output overlapping or iteration pipelining. Our flow is based on the analysis of a fine grain signal flow graph used to extract fine grain synchronizations and algorithmic expressions