Germain Haugou
STMicroelectronics
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Germain Haugou.
design automation conference | 2012
Diego Melpignano; Luca Benini; Eric Flamand; Bruno Jego; Thierry Lepley; Germain Haugou; Fabien Clermidy; Denis Dutoit
P2012 is an area- and power-efficient many-core computing accelerator based on multiple globally asynchronous, locally synchronous processor clusters. Each cluster features up to 16 processors with independent instruction streams sharing a multi-banked one-cycle access L1 data memory, a multi-channel DMA engine and specialized hardware for synchronization and aggressive power management. P2012 is 3D stacking ready and can be customized to achieve extreme area and energy efficiency by adding domain-specific HW IPs to the cluster. The first P2012 SoC prototype in 28nm CMOS will sample in Q3, featuring four 16-processor clusters, a 1MB L2 memory and delivering 80GOPS (with 32 bit single precision floating point support) in 18mm2 with 2W power consumption (worst-case). P2012 can run standard OpenCL™ and proprietary Native Programming Model SW components to achieve the highest level of control on application-to-resource mapping. A dedicated version of the OpenCV vision library is provided in the P2012 SW Development Kit to enable visual analytics acceleration. This paper will discuss preliminary performance measurements of common feature extraction and tracking algorithms, parallelized on P2012, versus sequential execution on ARM CPUs.
IEEE Transactions on Circuits and Systems | 2017
Francesco Conti; Robert Schilling; Pasquale Davide Schiavone; Antonio Pullini; Davide Rossi; Frank K. Gürkaynak; Michael Muehlberghuber; Michael Gautschi; Igor Loi; Germain Haugou; Stefan Mangard; Luca Benini
Near-sensor data analytics is a promising direction for internet-of-things endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data are stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a system-on-chip (SoC) based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65-nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep convolutional neural network (CNN) consuming 3.16pJ per equivalent reduced instruction set computer operation, local CNN-based face detection with secured remote recognition in 5.74pJ/op, and seizure detection with encrypted data collection from electroencephalogram within 12.7pJ/op.
computing frontiers | 2014
Davide Rossi; Igor Loi; Germain Haugou; Luca Benini
The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels.
computing frontiers | 2015
Igor Loi; Davide Rossi; Germain Haugou; Michael Gautschi; Luca Benini
L1 instruction caches in many-core systems represent a sizable fraction of the total power consumption. Although large instruction caches can significantly improve performance, they have the potential to increase power consumption. Private caches are usually able to achieve higher speed, due to their simpler design, but the smaller L1 memory space seen by each core induces a high miss ratio. Shared instruction cache can be seen as an attractive solution to improve performance and energy efficiency while reducing area. In this paper we propose a multi-banked, shared instruction cache architecture suitable for ultra-low power multicore systems, where parallelism and near threshold operation is used to achieve minimum energy. We implemented the cluster architecture with different configurations of cache sharing, utilizing the 28nm UTBB FD-SOI from STMicroelectronics as reference technology. Experimental results, based on several real-life applications, demonstrate that sharing mechanisms have no impact on the system operating frequency, and allow to reduce the energy consumption of the cache subsystem by up to 10%, while keeping the same area footprint, or reducing by 2× the overall shared cache area, while keeping the same performance and energy efficiency with respect to a cluster of processing elements with private program caches.
Lecture Notes in Computer Science | 2007
Sarah Hoffmann; Germain Haugou; Sophie Gabriele; Lilian Burdy
Microkernels have been developed to minimize the size of software that needs to run in privileged CPU-Mode. They provide only a set of general hardware abstractions, which then can be used to implement an operating system with a high level of reliability and security on top. L4 is a second generation microkernel based on the principles of minimalism, flexibility and efficiency. Its small size (about 15,000 lines of C++) and its relevance for security make it a good candidate for formal analysis. This paper describes our approach to develop a formal model of the API of the L4 microkernel. The goal was to evaluate the possibility to model such software with formal techniques, moreover, the objectives were: – to describe precisely the mechanisms of L4, – to obtain a more extensive documentation of the microkernel, – to highlight the points where the specification was incomplete, – to prove some static properties on the kernel. The formalism used to model the system is the Event B formalism. Event B allows to describe a system through a set of data with properties and the events modifying that data: an event either describes an internal behavior of the system or how the system reacts to external actions. Finally, B provides automatic and interactive tools to help proving formally that each event always respects the properties.
power and timing modeling optimization and simulation | 2013
Thomas Ducroux; Germain Haugou; Vincent Risson; Pascal Vivet
Power consumption is crucial in embedded systems, mainly because of the limited battery capacity and the problem of heat dissipation. The energy efficiency of System-on-Chips (SoCs) is optimized at both hardware and software level using simulation platforms. The challenge of these platforms lies in the tradeoff between accuracy and simulation speed for early architecture exploration and HW/SW validation. In the context of many-core systems in which heavy software stacks are executed, fast simulation platforms are required. In this paper, we present our power modeling approach of a complex many-core system to estimate the power consumption of software applications executed on it. We propose in particular a light and accurate power model for VLIW processors, as this kind of processor is commonly used in such systems. The power estimator we set up is part of a practical power characterization framework fully automated that includes low level simulations which are then used to back-annotate fast simulation models. Our case study is the STHORM accelerator, a clustered many-core architecture comprising dual-instruction issue processors, a complex memory hierarchy and DMA engines. Experimental results show that we can perform fast power estimations with an estimation error lower than 4% for the VLIW cores, which are the main source of power consumption, and less than 10% for the overall SoC platform, and for a simulation time overhead lower than 1%.
2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip | 2015
Giuseppe Tagliavini; Germain Haugou; Andrea Marongiu; Luca Benini
The acceleration of Computer Vision algorithms is an important enabler to support the more and more pervasive applications of the embedded vision domain. Heterogeneous systems featuring a clustered many-core accelerator are a very promising target for embedded vision workloads, but the code optimization for these platforms is a challenging task. In this work we introduce ADRENALINE 1, a novel framework for fast prototyping and optimization of OpenVX applications for heterogeneous SoCs with many-core accelerators. ADRENALINE consists of an optimized OpenVX run-time system and a virtual platform, and it is intended to provide support to a wide range of end users. We highlight the benefits of this approach in different optimization contexts.
conference on design and architectures for signal and image processing | 2014
Giuseppe Tagliavini; Germain Haugou; Luca Benini
Computer vision and computational photography are hot applications areas for mobile and embedded computing platforms. As a consequence, many-core accelerators are being developed to efficiently execute highly-parallel image processing kernels. However, power and cost constraints impose hard limits on the main memory bandwidth available, and push for software optimizations which minimize the usage of large frame buffers to store the intermediate results of multi-kernel applications. In this work we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution on cluster-based many-core accelerators of image processing applications expressed as standard OpenVX graphs. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator prototype demonstrate that our approach leads to massive reductions of main memory related stall time even when the main memory bandwidth available to the accelerator is severely constrained.
Journal of Real-time Image Processing | 2018
Giuseppe Tagliavini; Germain Haugou; Andrea Marongiu; Luca Benini
In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.
software and compilers for embedded systems | 2015
Giuseppe Tagliavini; Germain Haugou; Andrea Marongiu; Luca Benini
Nowadays Computer Vision application are ubiquitous, and their presence on embedded devices is more and more widespread. Heterogeneous embedded systems featuring a clustered manycore accelerator are a very promising target to execute embedded vision algorithms, but the code optimization for these platforms is a challenging task. Moreover, designers really need support tools that are both fast and accurate. In this work we introduce ADRENALINE, an environment for development and optimization of OpenVX applications targeting manycore accelerators. ADRENALINE consists of a custom OpenVX run-time and a virtual platform, and overall it is intended to provide support to enhance performance of embedded vision applications.