Francesco Conti | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Francesco Conti is active.

Explore More

Publication

Featured researches published by Francesco Conti.

design, automation, and test in europe | 2015

A ultra-low-energy convolution engine for fast brain-inspired vision in multicore clusters

Francesco Conti; Luca Benini

State-of-art brain-inspired computer vision algorithms such as Convolutional Neural Networks (CNNs) are reaching accuracy and performance rivaling that of humans; however, the gap in terms of energy consumption is still many degrees of magnitude wide. Many-core architectures using shared-memory clusters of power-optimized RISC processors have been proposed as a possible solution to help close this gap. In this work, we propose to augment these clusters with Hardware Convolution Engines (HWCEs): ultra-low energy coprocessors for accelerating convolutions, the main building block of many brain-inspired computer vision algorithms. Our synthesis results in ST 28nm FD-SOI technology show that the HWCE is capable of performing a convolution in the lowest-energy state spending as little as 35 pJ/pixel on average, with an optimum case of 6.5 pJ/pixel. Furthermore, we show that augmenting a cluster with a HWCE can lead to an average boost of 40x or more in energy efficiency in convolutional workloads.

IEEE Transactions on Circuits and Systems | 2017

An IoT Endpoint System-on-Chip for Secure and Energy-Efficient Near-Sensor Analytics

Francesco Conti; Robert Schilling; Pasquale Davide Schiavone; Antonio Pullini; Davide Rossi; Frank K. Gürkaynak; Michael Muehlberghuber; Michael Gautschi; Igor Loi; Germain Haugou; Stefan Mangard; Luca Benini

Near-sensor data analytics is a promising direction for internet-of-things endpoints, as it minimizes energy spent on communication and reduces network load - but it also poses security concerns, as valuable data are stored or sent over the network at various stages of the analytics pipeline. Using encryption to protect sensitive data at the boundary of the on-chip analytics engine is a way to address data security issues. To cope with the combined workload of analytics and encryption in a tight power envelope, we propose Fulmine, a system-on-chip (SoC) based on a tightly-coupled multi-core cluster augmented with specialized blocks for compute-intensive data processing and encryption functions, supporting software programmability for regular computing tasks. The Fulmine SoC, fabricated in 65-nm technology, consumes less than 20mW on average at 0.8V achieving an efficiency of up to 70pJ/B in encryption, 50pJ/px in convolution, or up to 25MIPS/mW in software. As a strong argument for real-life flexible application of our platform, we show experimental results for three secure analytics use cases: secure autonomous aerial surveillance with a state-of-the-art deep convolutional neural network (CNN) consuming 3.16pJ per equivalent reduced instruction set computer operation, local CNN-based face detection with secured remote recognition in 5.74pJ/op, and seizure detection with encrypted data collection from electroencephalogram within 12.7pJ/op.

signal processing systems | 2016

PULP: A Ultra-Low Power Parallel Accelerator for Energy-Efficient and Flexible Embedded Vision

Francesco Conti; Davide Rossi; Antonio Pullini; Igor Loi; Luca Benini

Novel pervasive devices such as smart surveillance cameras and autonomous micro-UAVs could greatly benefit from the availability of a computing device supporting embedded computer vision at a very low power budget. To this end, we propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology. We show that PULP performance can be scaled over a 1x-354x range, with a peak theoretical energy efficiency of 211 GOPS/W. We present performance results for several demanding kernels from the image processing and vision domain, with post-layout power modeling: a motion detection application that can run at an efficiency up to 192 GOPS/W (90 % of the theoretical peak); a ConvNet-based detector for smart surveillance that can be switched between 0.7 and 27fps operating modes, scaling energy consumption per frame between 1.2 and 12mJ on a 320 ×240 image; and FAST + Lucas-Kanade optical flow on a 128 ×128 image at the ultra-low energy budget of 14 μJ per frame at 60fps.

computer vision and pattern recognition | 2014

Brain-Inspired Classroom Occupancy Monitoring on a Low-Power Mobile Platform

Francesco Conti; Antonio Pullini; Luca Benini

Brain-inspired computer vision (BICV) has evolved rapidly in recent years and it is now competitive with traditional CV approaches. However, most of BICV algorithms have been developed on high power-and-performance platforms (e.g. workstations) or special purpose hardware. We propose two different algorithms for counting people in a classroom, both based on Convolutional Neural Networks (CNNs), a state-of-art deep learning model that is inspired on the structure of the human visual cortex. Furthermore, we provide a standalone parallel C library that implements CNNs and use it to deploy our algorithms on the embedded mobile ARM big. LITTLE-based Odroid-XU platform. Our performance and power measurements show that neuromorphic vision is feasible on off-the-shelf embedded mobile platforms, and we show that it can reach very good energy efficiency for non-time-critical tasks such as people counting.

signal processing systems | 2014

Energy-efficient vision on the PULP platform for ultra-low power parallel computing

Francesco Conti; Davide Rossi; Antonio Pullini; Igor Loi; Luca Benini

Many-core architectures structured as fabrics of tightly-coupled clusters have shown promising results on embedded computer vision benchmarks, providing state-of-art performance with a reduced power budget. We propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTB FD-SOI 28nm technology. As a use case for PULP, we show that a computationally demanding vision kernel based on Convolutional Neural Networks can be quickly and efficiently switched from a low power, low frame-rate operating point to a high frame-rate one when a detection is performed. Our results show that PULP performance can be scaled over a 1x-354x range, with a peak performance/power efficiency of 211 GOPS/W.

great lakes symposium on vlsi | 2014

He-P2012: architectural heterogeneity exploration on a scalable many-core platform

Francesco Conti; Chuck Pilkington; Andrea Marongiu; Luca Benini

Architectural heterogeneity is a promising solution to overcome the utilization wall and provide Moores Law-like performance scaling in future SoCs. However, heterogeneous architectures increase the size and complexity of the design space along several axes: granularity of the heterogeneous processors, coupling with the software cores, communication interfaces, etc. As a consequence, significant enhancements are required to tools and methodologies to explore the huge design space effectively. In this work, we provide three main contributions: first, we describe an extension to the STMicroelectronics P2012 platform to support tightly-coupled shared memory HW processing elements (HWPE), along with our changes to the P2012 simulation flow to integrate this extension. Second, we propose a novel methodology for the semi-automatic definition and instantiation of HWPEs from a C program based on a interface description language. Third, we explore several architectural variants on a set of benchmarks originally developed for the homogeneous version of P2012, achieving up to 123x speedup for the accelerated code region (~98% of the Amdahl limit for the whole application), thereby demonstrating the efficiency of tightly memory-coupled hardware acceleration.

ieee convention of electrical and electronics engineers in israel | 2014

Energy efficient parallel computing on the PULP platform with support for OpenMP

Davide Rossi; Igor Loi; Francesco Conti; Giuseppe Tagliavini; Antonio Pullini; Andrea Marongiu

Many-core architectures structured as fabrics of tightly-coupled clusters have shown promising results on embedded parallel applications, providing state-of-art performance with a reduced power budget. We propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology. To exploit thread level parallelism of applications PULP supports a lightweight implementation of OpenMP 3.0 running on a bare metal runtime optimized for embedded architectures. The proposed platform demonstrates able to provide high performance for a wide range of workloads ranging from 1.2 MOPS to 3 GOPS with a peak energy efficiency of 210 GOPS/W. Thanks to the efficient exploitation of forward and reverse body biasing on fine grained regions of the cluster, the platform is able to improve by up to 1.3x the energy efficiency of parallel portions, and by up to 2.4x the energy efficiency of sequential portions of OpenMP applications.

application-specific systems, architectures, and processors | 2014

He-P2012: Architectural heterogeneity exploration on a scalable many-core platform

Francesco Conti; Chuck Pilkington; Andrea Marongiu; Luca Benini

design, automation, and test in europe | 2016

Enabling the heterogeneous accelerator model on ultra-low power microcontroller platforms

Francesco Conti; Daniele Palossi; Andrea Marongiu; Davide Rossi; Luca Benini

The stringent power constraints of complex microcontroller based devices (e.g. smart sensors for the IoT) represent an obstacle to the introduction of sophisticated functionality. Programmable accelerators would be extremely beneficial to provide the flexibility and energy efficiency required by fast-evolving IoT applications; however, the integration complexity and sub-10mW power budgets have been considered insurmountable obstacles so far. In this paper we demonstrate the feasibility of coupling a low power microcontroller unit (MCU) with a heterogeneous programmable accelerator for speeding-up computation-intensive algorithms at an ultra-low power (ULP) sub-10mW budget. Specifically, we develop a heterogeneous architecture coupling a Cortex-M series MCU with PULP, a programmable accelerator for ULP parallel computing. Complex functionality is enabled by the support for offloading parallel computational kernels from the MCU to the accelerator using the OpenMP programming model. We prototype this platform using a STM Nucleo board and a PULP FPGA emulator. We show that our methodology can deliver up to 60× gains in performance and energy efficiency on a diverse set of applications, opening the way for a new class of ULP heterogeneous architectures.

international conference on hardware/software codesign and system synthesis | 2013

Synthesis-friendly techniques for tightly-coupled integration of hardware accelerators into shared-memory multi-core clusters

Francesco Conti; Andrea Marongiu; Luca Benini

Several many-core designs tackle scalability issues by leveraging tightly-coupled clusters as building blocks, where low-latency, high-bandwidth interconnection between a small/medium number of cores and L1 memory achieves high performance/watt. Tight coupling of hardware accelerators into these multicore clusters constitutes a promising approach to further improve performance/area/watt. However, accelerators are often clocked at a lower frequency than processor clusters for energy efficiency reasons. In this paper, we propose a technique to integrate shared-memory accelerators within the tightly-coupled clusters of the STMicroelectronics STHORM architecture. Our methodology significantly relaxes timing constraints for tightly-coupled accelerators, while optimizing data bandwidth. In addition, our technique allows to operate the accelerator at an integer submultiple of the cluster frequency. Experimental results show that the proposed approach allows to recover up to 84% of the slow-down implied by reduced accelerator speed.

Explore More