Alessandro Capotondi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Alessandro Capotondi is active.

Explore More

Publication

Featured researches published by Alessandro Capotondi.

IEEE Transactions on Industrial Informatics | 2015

Simplifying Many-Core-Based Heterogeneous SoC Programming With Offload Directives

Andrea Marongiu; Alessandro Capotondi; Giuseppe Tagliavini; Luca Benini

Multiprocessor systems-on-chip (MPSoC) are evolving into heterogeneous architectures based on one host processor plus many-core accelerators. While heterogeneous SoCs promise higher performance/watt, they are programmed at the cost of major code rewrites with low-level programming abstractions (e.g, OpenCL). We present a programming model based on OpenMP, with additional directives to program the accelerator from a single host program. As a test case, we evaluate an implementation of this programming model for the STMicroelectronics STHORM development board. We obtain near-ideal throughput for most benchmarks, very close performance to hand-optimized OpenCL codes at a significantly lower programming complexity, and up to 30× speedup versus host execution time.

parallel computing | 2016

Controlling NUMA effects in embedded manycore applications with lightweight nested parallelism support

Andrea Marongiu; Alessandro Capotondi; Luca Benini

Lightweight nested parallelism support for NUMA embedded many-cores is proposed.SW- and HW-accelerated solutions are explored and integrated in OpenMP.Implementation is provided on a real manycore (STMicroelectronics STHORM).We achieve 28 speedup VS flat parallelism on several real embedded applications. Embedded manycore architectures are often organized as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level which make memory operations nonuniform (NUMA). Due to NUMA, regular applications typically employed in the embedded domain (e.g., image processing, computer vision, etc.) ultimately behave as irregular workloads if a flat memory system is assumed at the program level. Nested parallelism represents a powerful programming abstraction for these architectures, provided that (i) streamlined middleware support is available, whose overhead does not dominate the run-time of fine-grained applications; (ii) a mechanism to control thread binding at the cluster-level is supported. We present a lightweight runtime layer for nested parallelism on cluster-based embedded manycores, integrating our primitives in the OpenMP runtime system, and implementing a new directive to control NUMA-aware nested parallelism mapping. We explore on a set of real application use cases how NUMA makes regular parallel workloads behave as irregular, and how our approach allows to control such effects and achieve up to 28 speedup versus flat parallelism.

networks on chips | 2014

Augmenting manycore programmable accelerators with photonic interconnect technology for the high-end embedded computing domain

Marco Balboni; Marta Ortìn Obòn; Alessandro Capotondi; Hervé Tatenguem; Alberto Ghiribaldi; Luca Ramini; Victor Viñal; Andrea Marongiu; Davide Bertozzi

There is today consensus on the fact that optical interconnects can relieve bandwidth density concerns at integrated circuit boundaries. However, when it comes to the extension of this emerging interconnect technology to on-chip communication as well, such consensus seems to fall apart. The main reason consists of a fundamental lack of compelling cases proving the superior performance and/or energy properties yielded by devices of practical interest, when re-architected around a photonically-integrated communication fabric. This paper takes its steps from the consideration that manycore computing platforms are gaining momentum in the high-end embedded computing domain in the form of general-purpose programmable accelerators. Hence, the performance and energy implications when augmenting these devices with optical interconnect technology are derived by means of an accurate benchmarking framework against an aggressively optimized electrical counterpart.

software and compilers for embedded systems | 2017

Enabling zero-copy OpenMP offloading on the PULP many-core accelerator

Alessandro Capotondi; Andrea Marongiu

Many-core heterogeneous designs are nowadays widely available among embedded systems. Initiatives such as the HSA push for a model where the host processor and the accelerator(s) communicate via coherent, Unified Virtual Memory (UVM). In this paper we describe our experience in porting the OpenMP v4 programming model to a low-end, heterogeneous embedded system based on the PULP many-core accelerator featuring lightweight (software-managed) UVM support. We describe a GCC-based toolchain which enables: i) the automatic generation of host and accelerator binaries from a single, high-level, OpenMP parallel program; ii) the automatic instrumentation of the accelerator program to transparently manage UVM. This enables up to 4x faster execution compared to traditional copy-based offload mechanisms.

international conference on high performance computing and simulation | 2016

On the effectiveness of OpenMP teams for cluster-based many-core accelerators

Alessandro Capotondi; Andrea Marongiu

With the introduction of more powerful and massively parallel embedded processors, embedded systems are becoming HPC-capable. Heterogeneous on-chip systems (SoC) that couple a general-purposehost processor to a many-core accelerator are becoming more and more widespread, and provide tremendous peak performance/watt, well suited to execute HPC-class programs. The increased computation potential is however traded off for ease programming. Application developers are indeed required to manually deal with outlining code parts suitable for acceleration, parallelize them efficiently over many available cores, and orchestrate data transfers to/from the accelerator. In addition, since most many-cores are organized as a collection ofclusters, featuring fast local communication but slow remote communication (i.e., to another clusters local memory), the programmer should also take care of properly mapping the parallel computation so as to avoid poor data locality. OpenMP v4.0 introduces new constructs for computation offloading, as well as directives to deploy parallel computation in a cluster-aware manner. In this paper we assess the effectiveness of OpenMP v4.0 at exploiting the massive parallelism available in embedded heterogeneous SoCs, comparing to standard parallel loops over several computation-intensive applications from the linear algebra and image processing domains.

2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip | 2015

Enabling Scalable and Fine-Grained Nested Parallelism on Embedded Many-cores

Alessandro Capotondi; Andrea Marongiu; Luca Benini

Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hierarchical interconnect and distributed memory with non-uniform access (NUMA). Nested parallelism is a convenient programming abstraction for large-scale cc-NUMA systems, which allows to hierarchically (and dynamically) create multiple levels of fine-grained parallelism whenever it is available. Available implementations for cc-NUMA systems introduce large overheads for nested parallelism management, which cannot be tolerated due to the extremely fine-grained nature of embedded parallel workloads. In particular, creating a team of parallel threads has a cost that increases linearly with the number of threads, which is inherently non scalable. This work presents a software cache mechanism for frequently-used parallel team configurations to speed up parallel thread creation overheads in PMCA systems. When a configuration is found in the cache the cost for parallel team creation has a constant time, providing a scalable mechanism. We evaluated our support on the STMicroelectronics STHORM many-core. Compared to the state-of-the art, our solution shows that: i) the cost for parallel team creation is reduced by up to 67%, ii) the tangible effect on real ultra-fine-grained parallel kernels is a speedup of up to 80%.

Proceedings of the First International Workshop on Many-core Embedded Systems | 2013

Improving the programmability of STHORM-based heterogeneous systems with offload-enabled OpenMP

Andrea Marongiu; Alessandro Capotondi; Giuseppe Tagliavini; Luca Benini

ieee hot chips symposium | 2015

PULP: A parallel ultra low power platform for next generation IoT applications

Davide Rossi; Francesco Conti; Andrea Marongiu; Antonio Pullini; Igor Loi; Michael Gautschi; Giuseppe Tagliavini; Alessandro Capotondi; Philippe Flatresse; Luca Benini

IEEE Transactions on Emerging Topics in Computing | 2018