Giuseppe Tagliavini | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Giuseppe Tagliavini is active.

Explore More

Publication

Featured researches published by Giuseppe Tagliavini.

design, automation, and test in europe | 2013

Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters

Paolo Burgio; Giuseppe Tagliavini; Andrea Marongiu; Luca Benini

Cluster-based architectures are increasingly being adopted to design embedded many-cores. These platforms can deliver very high peak performance within a contained power envelope, provided that programmers can make effective use the available parallel cores. This is becoming an extremely difficult task, as embedded applications are growing in complexity and exhibit irregular and dynamic parallelism. The OpenMP tasking extensions represent a powerful abstraction to capture this form of parallelism. However, efficiently supporting it on cluster-based embedded SoCs is not easy, because the fine-grained parallel workload present in embedded applications can not tolerate high memory and run-time overheads. In this paper we present our design of the runtime support layer to OpenMP tasking for an embedded shared memory cluster, identifying key aspects to achieving performance and discussing important architectural support to removing major bottlenecks.

IEEE Transactions on Industrial Informatics | 2015

Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives

Andrea Marongiu; Alessandro Capotondi; Giuseppe Tagliavini; Luca Benini

Multiprocessor systems-on-chip (MPSoC) are evolving into heterogeneous architectures based on one host processor plus many-core accelerators. While heterogeneous SoCs promise higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, OpenCL). We present a programming model based on OpenMP, with additional directives to program the accelerator from a single host program. As a test case, we evaluate an implementation of this programming model for the STMicroelectronics STHORM development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL codes at a significantly lower programming complexity, and up to 30× speedup versus host execution time.

high-performance computer architecture | 2015

Exploring architectural heterogeneity in intelligent vision systems

Nandhini Chandramoorthy; Giuseppe Tagliavini; Kevin M. Irick; Antonio Pullini; Siddharth Advani; Sulaiman Al Habsi; Matthew Cotter; Jack Sampson; Vijaykrishnan Narayanan; Luca Benini

Limited power budgets and the need for high performance computing have led to platform customization with a number of accelerators integrated with CMPs. In order to study customized architectures, we model four customization design points and compare their performance and energy across a number of computer vision workloads. We analyze the limitations of generic architectures and quantify the costs of increasing customization using these micro-architectural design points. This analysis leads us to develop a framework consisting of low-power multi-cores and an array of configurable micro-accelerator functional units. Using this platform, we illustrate dataflow and control processing optimizations that provide for performance gains similar to custom ASICs for a wide range of vision benchmarks.

ieee convention of electrical and electronics engineers in israel | 2014

Energy efficient parallel computing on the PULP platform with support for OpenMP

Davide Rossi; Igor Loi; Francesco Conti; Giuseppe Tagliavini; Antonio Pullini; Andrea Marongiu

Many-core architectures structured as fabrics of tightly-coupled clusters have shown promising results on embedded parallel applications, providing state-of-art performance with a reduced power budget. We propose PULP (Parallel processing Ultra-Low Power platform), an architecture built on clusters of tightly-coupled OpenRISC ISA cores, with advanced techniques for fast performance and energy scalability that exploit the capabilities of the STMicroelectronics UTBB FD-SOI 28nm technology. To exploit thread level parallelism of applications PULP supports a lightweight implementation of OpenMP 3.0 running on a bare metal runtime optimized for embedded architectures. The proposed platform demonstrates able to provide high performance for a wide range of workloads ranging from 1.2 MOPS to 3 GOPS with a peak energy efficiency of 210 GOPS/W. Thanks to the efficient exploitation of forward and reverse body biasing on fine grained regions of the cluster, the platform is able to improve by up to 1.3x the energy efficiency of parallel portions, and by up to 2.4x the energy efficiency of sequential portions of OpenMP applications.

computing frontiers | 2011

MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor

Alessio Franceschelli; Paolo Burgio; Giuseppe Tagliavini; Andrea Marongiu; Martino Ruggiero; Michele Lombardi; Alessio Bonfietti; Michela Milano; Luca Benini

We present MPOpt-Cell, an architecture-aware framework for high-productivity development and efficient execution of stream applications on the CELL BE Processor. It enables developers to quickly build Synchronous Data Flow (SDF) applications using a simple and intuitive programming interface based on a set of compiler directives that capture the key abstractions of SDF. The compiler backend and system runtime efficiently manage hardware resources.

2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip | 2015

ADRENALINE: An OpenVX Environment to Optimize Embedded Vision Applications on Many-core Accelerators

Giuseppe Tagliavini; Germain Haugou; Andrea Marongiu; Luca Benini

The acceleration of Computer Vision algorithms is an important enabler to support the more and more pervasive applications of the embedded vision domain. Heterogeneous systems featuring a clustered many-core accelerator are a very promising target for embedded vision workloads, but the code optimization for these platforms is a challenging task. In this work we introduce ADRENALINE 1, a novel framework for fast prototyping and optimization of OpenVX applications for heterogeneous SoCs with many-core accelerators. ADRENALINE consists of an optimized OpenVX run-time system and a virtual platform, and it is intended to provide support to a wide range of end users. We highlight the benefits of this approach in different optimization contexts.

conference on design and architectures for signal and image processing | 2014

Optimizing memory bandwidth in OpenVX graph execution on embedded many-core accelerators

Giuseppe Tagliavini; Germain Haugou; Luca Benini

Computer vision and computational photography are hot applications areas for mobile and embedded computing platforms. As a consequence, many-core accelerators are being developed to efficiently execute highly-parallel image processing kernels. However, power and cost constraints impose hard limits on the main memory bandwidth available, and push for software optimizations which minimize the usage of large frame buffers to store the intermediate results of multi-kernel applications. In this work we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution on cluster-based many-core accelerators of image processing applications expressed as standard OpenVX graphs. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator prototype demonstrate that our approach leads to massive reductions of main memory related stall time even when the main memory bandwidth available to the accelerator is severely constrained.

design, automation, and test in europe | 2014

Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters

Paolo Burgio; Giuseppe Tagliavini; Francesco Conti; Andrea Marongiu; Luca Benini

Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.

Journal of Real-time Image Processing | 2018

Optimizing memory bandwidth exploitation for OpenVX applications on embedded many-core accelerators

Giuseppe Tagliavini; Germain Haugou; Andrea Marongiu; Luca Benini

In recent years, image processing has been a key application area for mobile and embedded computing platforms. In this context, many-core accelerators are a viable solution to efficiently execute highly parallel kernels. However, architectural constraints impose hard limits on the main memory bandwidth, and push for software techniques which optimize the memory usage of complex multi-kernel applications. In this work, we propose a set of techniques, mainly based on graph analysis and image tiling, targeted to accelerate the execution of image processing applications expressed as standard OpenVX graphs on cluster-based many-core accelerators. We have developed a run-time framework which implements these techniques using a front-end compliant to the OpenVX standard, and based on an OpenCL extension that enables more explicit control and efficient reuse of on-chip memory and greatly reduces the recourse to off-chip memory for storing intermediate results. Experiments performed on the STHORM many-core accelerator demonstrate that our approach leads to massive reduction of time and bandwidth, even when the main memory bandwidth for the accelerator is severely constrained.

software and compilers for embedded systems | 2015

A framework for optimizing OpenVX applications performance on embedded manycore accelerators

Giuseppe Tagliavini; Germain Haugou; Andrea Marongiu; Luca Benini

Nowadays Computer Vision application are ubiquitous, and their presence on embedded devices is more and more widespread. Heterogeneous embedded systems featuring a clustered manycore accelerator are a very promising target to execute embedded vision algorithms, but the code optimization for these platforms is a challenging task. Moreover, designers really need support tools that are both fast and accurate. In this work we introduce ADRENALINE, an environment for development and optimization of OpenVX applications targeting manycore accelerators. ADRENALINE consists of a custom OpenVX run-time and a virtual platform, and overall it is intended to provide support to enhance performance of embedded vision applications.

Explore More