Thierry Lepley
STMicroelectronics
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Thierry Lepley.
design automation conference | 2012
Diego Melpignano; Luca Benini; Eric Flamand; Bruno Jego; Thierry Lepley; Germain Haugou; Fabien Clermidy; Denis Dutoit
P2012 is an area- and power-efficient many-core computing accelerator based on multiple globally asynchronous, locally synchronous processor clusters. Each cluster features up to 16 processors with independent instruction streams sharing a multi-banked one-cycle access L1 data memory, a multi-channel DMA engine and specialized hardware for synchronization and aggressive power management. P2012 is 3D stacking ready and can be customized to achieve extreme area and energy efficiency by adding domain-specific HW IPs to the cluster. The first P2012 SoC prototype in 28nm CMOS will sample in Q3, featuring four 16-processor clusters, a 1MB L2 memory and delivering 80GOPS (with 32 bit single precision floating point support) in 18mm2 with 2W power consumption (worst-case). P2012 can run standard OpenCL™ and proprietary Native Programming Model SW components to achieve the highest level of control on application-to-resource mapping. A dedicated version of the OpenCV vision library is provided in the P2012 SW Development Kit to enable visual analytics acceleration. This paper will discuss preliminary performance measurements of common feature extraction and tracking algorithms, parallelized on P2012, versus sequential execution on ARM CPUs.
high performance embedded architectures and compilers | 2012
Selma Saidi; Pranav Tendulkar; Thierry Lepley; Oded Maler
In this paper we investigate a general approach to automate some deployment decisions for a certain class of applications on multi-core computers. We consider data-parallelizable programs that use the well-known double buffering technique to bring the data from the off-chip slow memory to the local memory of the cores via a DMA (direct memory access) mechanism. Based on the computation time and size of elementary data items as well as DMA characteristics, we derive optimal and near optimal values for the number of blocks that should be clustered in a single DMA command. We then extend the results to the case where a computation for one data item needs some data in its neighborhood. In this setting we characterize the performance of several alternative mechanisms for data sharing. Our models are validated experimentally using a cycle-accurate simulator of the Cell Broadband Engine architecture.
compilers architecture and synthesis for embedded systems | 2013
Thierry Lepley; Pierre G. Paulin; Eric Flamand
Explicitly managed memory many-cores (EMM) have been a part of the industrial landscape for the last decade. The IBM CELL processor, general-purpose graphics processing units (GP-GPU) and the STHORM embedded many-core of STMicroelectronics are representative examples. This class of architecture is expected to scale well and to deliver good performance per watt and per mm2 of silicon. As such, it is appealing for application problems with regular data access patterns. However, this moves significant complexity to the programmer who must master parallelization and data movement. High level programming tools are therefore essential in order to allow the effective programming of EMM many-cores to a wide class of programmers. This paper presents a novel approach designed for simplifying the programming of EMM many-core architectures. It initially addresses the image processing application domain and has been targeted to the STHORM platform. It takes a high-level description of the computation kernel algorithm and generates an OpenCL kernel optimized for the target architecture, while managing the parallelization and data movements across the hierarchy in a transparent fashion. The goal is to provide both high productivity and high performance without requiring parallel computing expertise from the programmer, nor the need for application code specialization for the target architecture.
languages, compilers, and tools for embedded systems | 2004
Jean-Marc Daveau; Thomas Thery; Thierry Lepley; Miguel Santana
This paper describes the FlexCC2 register allocation framework. FlexCC2 is an optimizing retargetable C compiler for embedded processors, and in particular for DSP processors. Embedded processors often contain features such as irregular and constrained register sets that complicate register allocation, making traditional methods inefficient. In this paper, we present a register allocation framework specifically tailored for embedded processor specificities. This framework has been integrated in the FlexCC2 production compiler and is used by FlexCC2 customers.
design, automation, and test in europe | 2005
J. Cornbaz; J.-C. Fernandez; Thierry Lepley; Joseph Sifakis
We propose a method for fine grain QoS control of dataflow applications. We assume that the application software is described as the composition of actions (C-functions) with quality level parameters. The method allows a QoS controller to be computed from this description, and also average execution times, worst case execution times and deadlines for its actions. The controller computes dynamically feasible schedules and quality assignments for their actions. Furthermore, the control policy ensures optimal time budget utilization. A prototype tool implementing the method is shown, as well as experimental results for a non trivial example. The results show the interest of fine grain QoS control for video encoders.
embedded software | 2005
Jacques Combaz; Jean-Claude Fernandez; Thierry Lepley; Joseph Sifakis
We propose a method for fine grain QoS control of real-time applications. The method allows adapting the overall system behavior by adequately setting the quality level parameters of its actions. The objective of the control policy is to meet QoS requirements including three types of properties: 1) safety that is, no deadline is missed; 2) optimality that is, maximization of the available time budget; 3) smoothness of quality levels. The method takes as input a model of the application software, QoS requirements and platform-dependent timing information, and produces a controlled application software meeting the QoS requirements on the target platform. This paper provides a complete formalization of the quality control problem. It proposes a new control management policy ensuring safety, near-optimality and smoothness. It also describes a prototype tool implementing the quality control algorithm and experimental results about its application to a video encoder.
embedded software | 2002
Valérie Bertin; Jean-Marc Daveau; Philippe Guillaume; Thierry Lepley; Denis Pilat; Claire Richard; Miguel Santana; Thomas Thery
The design of efficient compilers for embedded processors has emerged with the growing importance of embedded application-specific processors and DSPs in consumer, multimedia and communication applications. We present in this paper the FlexCC2 compiler. FlexCC2 is a retargetable compiler for embedded processors, part of the FlexWare embedded software development environment. Application specific processors often contain specific and dedicated features like specific instructions that traditional compilers hardly accommodate. In this context, compilers able to produce high quality code, both in size and performance while being easily retargetable and able to use processor specific instructions represent a particular competitive differentiation. FlexCC2 offers such a differentiation to its users.
design, automation, and test in europe | 2013
Edoardo Paone; N. Vahabi; Vittorio Zaccaria; Cristina Silvano; Diego Melpignano; Germain Haugou; Thierry Lepley
In this paper, we introduce a novel modeling technique to reduce the time associated with cycle-accurate simulation of parallel applications deployed on many-core embedded platforms. We introduce an ensemble model based on artificial neural networks that exploits (in the training phase) multiple levels of simulation abstraction, from cycle-accurate to cycle-approximate, to predict the cycle-accurate results for unknown application configurations. We show that high-level modeling can be used to significantly reduce the number of low-level model evaluations provided that a suitable artificial neural network is used to aggregate the results. We propose a methodology for the design and optimization of such an ensemble model and we assess the proposed approach for an industrial simulation framework based on STMicroelectronics STHORM (P2012) many-core computing fabric.
international conference on hardware/software codesign and system synthesis | 2012
Edoardo Paone; Gianluca Palermo; Vittorio Zaccaria; Cristina Silvano; Diego Melpignano; Germain Haugou; Thierry Lepley
Open Computing Language (OpenCL) is emerging as a standard for parallel programming of heterogeneous hardware accelerators. With respect to device specific languages, OpenCL enables application portability but does not guarantee performance portability, eventually requiring additional tuning of the implementation to a specific platform or to unpredictable dynamic workloads. In this paper, we present a methodology to analyze the customization space of an OpenCL application in order to improve performance portability and to support dynamic adaptation. We formulate our case study by implementing an OpenCL image stereo-matching application (which computes the relative depth of objects from a pair of stereo images) customized to the STMicroelectronics Platform 2012 many-core computing fabric. In particular, we use design space exploration techniques to generate a set of operating points that represent specific configurations of the parameters allowing different trade-offs between performance and accuracy of the algorithm itself. These points give detailed knowledge about the interaction between the application parameters, the underlying architecture and the performance of the system; they could also be used by a run-time manager software layer to meet dynamic Quality-of-Service (QoS) constraints. To analyze the customization space, we use cycle-accurate simulations for the target architecture. Since the profiling phase of each configuration takes a long simulation time, we designed our methodology to reduce the overall number of simulations by exploiting some important features of the application parameters; our analysis also enables the identification of the parameters that could be explored on a high-level simulation model to reduce the simulation time. The resulting methodology is one order of magnitude more efficient than an exhaustive exploration and, given its randomized nature, it increases the probability to avoid sub-optimal trade-offs.
Archive | 2001
José Sanches; Marco Cornero; Miguel Santana; Philippe Guillaume; Jean-Marc Daveau; Thierry Lepley; Pierre G. Paulin; Michel Harrand