Paolo Burgio | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paolo Burgio is active.

Explore More

Publication

Featured researches published by Paolo Burgio.

design, automation, and test in europe | 2012

Fast and lightweight support for nested parallelism on cluster-based embedded many-cores

Andrea Marongiu; Paolo Burgio; Luca Benini

Several recent many-core accelerators have been architected as fabrics of tightly-coupled shared memory clusters. A hierarchical interconnection system is used - with a crossbar-like medium inside each cluster and a network-on-chip (NoC) at the global level - which make memory operations non-uniform (NUMA). Nested parallelism represents a powerful programming abstraction for these architectures, where a first level of parallelism can be used to distribute coarse-grained tasks to clusters, and additional levels of fine-grained parallelism can be distributed to processors within a cluster. This paper presents a lightweight and highly optimized support for nested parallelism on cluster-based embedded many-cores. We assess the costs to enable multi-level parallelization and demonstrate that our techniques allow to extract high degrees of parallelism.

Microprocessors and Microsystems | 2011

Supporting OpenMP on a multi-cluster embedded MPSoC

Andrea Marongiu; Paolo Burgio; Luca Benini

The ever-increasing complexity of MPSoCs is putting the production of software on the critical path in embedded system development. Several programming models and tools have been proposed in the recent past that aim to facilitate application development for embedded MPSoCs. OpenMP is a mature and easy-to-use standard for shared memory programming, which has recently been successfully adopted in embedded MPSoC programming as well. To achieve performance, however, it is necessary that the implementation of OpenMP constructs efficiently exploits the many peculiarities of MPSoC hardware, and that custom features are provided to the programmer to control it. In this paper we consider a representative template of a modern multi-cluster embedded MPSoC and present an extensive evaluation of the cost associated with supporting OpenMP on such a machine, investigating several implementation variants that are aware of the memory hierarchy and of the heterogeneous interconnection.

design, automation, and test in europe | 2013

Variation-tolerant OpenMP tasking on tightly-coupled processor clusters

Abbas Rahimi; Andrea Marongiu; Paolo Burgio; Rajesh K. Gupta; Luca Benini

We present a variation-tolerant tasking technique for tightly-coupled shared memory processor clusters that relies upon modeling advance across the hardware/software interface. This is implemented as an extension to the OpenMP 3.0 tasking programming model. Using the notion of Task-Level Vulnerability (TLV) proposed here, we capture dynamic variations caused by circuit-level variability as a high-level software knowledge. This is accomplished through a variation-aware hardware/software codesign where: (i) Hardware features variability monitors in conjunction with online per-core characterization of TLV metadata; (ii) Software supports a Task-level Errant Instruction Management (TEIM) technique to utilize TLV metadata in the runtime OpenMP task scheduler. This method greatly reduces the number of recovery cycles compared to the baseline scheduler of OpenMP [22], consequently instruction per cycle (IPC) of a 16-core processor cluster is increased up to 1.51× (1.17× on average). We evaluate the effectiveness of our approach with various number of cores (4,8,12,16), and across a wide temperature range(ΔT=90°C).

digital systems design | 2012

OpenMP-based Synergistic Parallelization and HW Acceleration for On-Chip Shared-Memory Clusters

Paolo Burgio; Andrea Marongiu; Dominique Heller; Cyrille Chavet; Philippe Coussy; Luca Benini

Modern embedded MPSoC designs increasingly couple hardware accelerators to processing cores to trade between energy efficiency and platform specialization. To assist effective design of such systems there is the need on one hand for clear methodologies to streamline accelerator definition and instantiation, on the other for architectural templates and run-time techniques that minimize processors-to-accelerator communication costs. In this paper we present an architecture featuring tightly-coupled processors and accelerators, with zero-copy communication. Efficient programming is supported by an extended OpenMP programming model, where custom directives allow to specialize code regions for execution on parallel cores, accelerators, or a mix of the two. Our integrated approach enables fast yet accurate exploration of accelerator-based HW and SW architectures.

design, automation, and test in europe | 2013

Enabling fine-grained OpenMP tasking on tightly-coupled shared memory clusters

Paolo Burgio; Giuseppe Tagliavini; Andrea Marongiu; Luca Benini

Cluster-based architectures are increasingly being adopted to design embedded many-cores. These platforms can deliver very high peak performance within a contained power envelope, provided that programmers can make effective use the available parallel cores. This is becoming an extremely difficult task, as embedded applications are growing in complexity and exhibit irregular and dynamic parallelism. The OpenMP tasking extensions represent a powerful abstraction to capture this form of parallelism. However, efficiently supporting it on cluster-based embedded SoCs is not easy, because the fine-grained parallel workload present in embedded applications can not tolerate high memory and run-time overheads. In this paper we present our design of the runtime support layer to OpenMP tasking for an embedded shared memory cluster, identifying key aspects to achieving performance and discussing important architectural support to removing major bottlenecks.

real time technology and applications symposium | 2011

Bus Access Design for Combined Worst and Average Case Execution Time Optimization of Predictable Real-Time Applications on Multiprocessor Systems-on-Chip

Jakob Rosén; Carl-Fredrik Neikter; Petru Eles; Zebo Peng; Paolo Burgio; Luca Benini

Optimization techniques for improving the average-case execution time of an application, for which predictability with respect to time is not required, have been investigated for a long time in many different contexts. However, this has traditionally been done without paying attention to the worst-case execution time. For predictable real-time applications, on the other hand, the focus has been solely on worst-case execution time optimization, ignoring how this affects the execution time in the average case. In this paper, we show that having a good average-case delay can be important also for real-time applications for which predictability is required. Furthermore, for real-time applications running on multiprocessor systems-on-chip, we present a technique for optimizing the average case and the worst case simultaneously, allowing for a good average-case execution time while still keeping the worst case as small as possible.

computing frontiers | 2011

MPOpt-Cell: a high-performance data-flow programming environment for the CELL BE processor

Alessio Franceschelli; Paolo Burgio; Giuseppe Tagliavini; Andrea Marongiu; Martino Ruggiero; Michele Lombardi; Alessio Bonfietti; Michela Milano; Luca Benini

We present MPOpt-Cell, an architecture-aware framework for high-productivity development and efficient execution of stream applications on the CELL BE Processor. It enables developers to quickly build Synchronous Data Flow (SDF) applications using a simple and intuitive programming interface based on a set of compiler directives that capture the key abstractions of SDF. The compiler backend and system runtime efficiently manage hardware resources.

Real-Time and Embedded Systems and Technologies (RTEST), 2015 CSI Symposium on | 2015

A memory-centric approach to enable timing-predictability within embedded many-core accelerators

Paolo Burgio; Andrea Marongiu; Paolo Valente; Marko Bertogna

There is an increasing interest among real-time systems architects for multi- and many-core accelerated platforms. The main obstacle towards the adoption of such devices within industrial settings is related to the difficulties in tightly estimating the multiple interferences that may arise among the parallel components of the system. This in particular concerns concurrent accesses to shared memory and communication resources. Existing worst-case execution time analyses are extremely pessimistic, especially when adopted for systems composed of hundreds-tothousands of cores. This significantly limits the potential for the adoption of these platforms in real-time systems. In this paper, we study how the predictable execution model (PREM), a memory-aware approach to enable timing-predictability in realtime systems, can be successfully adopted on multi- and manycore heterogeneous platforms. Using a state-of-the-art multi-core platform as a testbed, we validate that it is possible to obtain an order-of-magnitude improvement in the WCET bounds of parallel applications, if data movements are adequately orchestrated in accordance with PREM. We identify which system parameters mostly affect the tremendous performance opportunities offered by this approach, both on average and in the worst case, moving the first step towards predictable many-core systems.

international symposium on parallel architectures algorithms and programming | 2015

Efficient Implementation of Genetic Algorithms on GP-GPU with Scheduled Persistent CUDA Threads

Nicola Capodieci; Paolo Burgio

In this paper we present a heavily exploration oriented implementation of genetic algorithms to be executed on graphic processor units (GPUs) that is optimized with our novel mechanism for scheduling GPU-side synchronized jobs that takes inspiration from the concept of persistent threads. Persistent Threads allow an efficient distribution of work loads throughout the GPU so to fully exploit the CUDA (NVIDIAs proprietary Compute Unified Device Architecture) architecture. Our approach (named Scheduled Light Kernel, SLK) uses a specifically designed data structure for issuing sequences of commands from the CPU to the GPU able to minimize CPUGPU communications, exploit streams of concurrent execution of different device side functions within different Streaming Multiprocessors and minimize kernels launch overhead. Results obtained on two completely different experimental settings show that our approach is able to dramatically increase the performance of the tested genetic algorithms compared to the baseline implementation that (while still running on a GPU) does not exploit our proposed approach. Our proposed SLK approach does not require substantial code rewriting and is also compared to newly introduced features in the last CUDA development toolkit, such as nested kernel invocations for dynamic parallelism.

Microprocessors and Microsystems | 2015

P-SOCRATES

Luis Miguel Pinho; Vincent Nélis; Patrick Meumeu Yomsi; Eduardo Quiñones; Marko Bertogna; Paolo Burgio; Andrea Marongiu; Claudio Scordino; Paolo Gai; Michele Ramponi; Michal M Mardiak

The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand, however, raises the problem about how to guarantee timing requirements in presence of parallel execution. This paper presents the approach of project P-SOCRATES for the design of an integrated framework for the execution of workload-intensive applications with real-time requirements on top of next-generation commercial-off-the-shelf (COTS) platforms based on many-core accelerated architectures. The time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, identifying the main sources of indeterminism and proposing efficient mapping and scheduling algorithms, along with the associated timing and schedulability analysis, to guarantee the real-time and performance requirements of the applications.

Explore More