Paolo Mantovani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Paolo Mantovani is active.

Explore More

Publication

Featured researches published by Paolo Mantovani.

IEEE Journal of Solid-state Circuits | 2012

A Switched-Inductor Integrated Voltage Regulator With Nonlinear Feedback and Network-on-Chip Load in 45 nm SOI

Noah Sturcken; Michele Petracca; Steve B. Warren; Paolo Mantovani; Luca P. Carloni; Angel V. Peterchev; Kenneth L. Shepard

A four-phase integrated buck converter in 45 nm silicon-on-insulator (SOI) technology is presented. The controller uses unlatched pulse-width modulation (PWM) with nonlinear gain to provide both stable small-signal dynamics and fast response (~700 ps) to large input and output transients. This fast control approach reduces the required output capacitance by 5× in comparison to a conventional, latched PWM controller at a similar operating point. The converter switches package-integrated air-core inductors at 80 MHz and delivers 1 A/mm2 at 83% efficiency and 0.66 conversion ratio. A network-on-chip (NoC) serves as a realistic digital load along with a programmable current source capable of generating load current steps with slew rate of ~1 A/100 ps for characterization of the control scheme.

design automation conference | 2015

An Analysis of Accelerator Coupling in Heterogeneous Architectures

Emilio G. Cota; Paolo Mantovani; Giuseppe Di Guglielmo; Luca P. Carloni

Existing research on accelerators has emphasized the performance and energy efficiency improvements they can provide, devoting little attention to practical issues such as accelerator invocation and interaction with other on-chip components (e.g. cores, caches). In this paper we present a quantitative study that considers these aspects by implementing seven high-throughput accelerators following three design models: tight coupling behind a CPU, loose out-of-core coupling with Direct Memory Access (DMA) to the LLC, and loose out-of-core coupling with DMA to DRAM. A salient conclusion of our study is that working sets of non-trivial size are best served by loosely-coupled accelerators that integrate private memory blocks tailored to their needs.

IEEE Computer Architecture Letters | 2014

Accelerator Memory Reuse in the Dark Silicon Era

Emilio G. Cota; Paolo Mantovani; Michele Petracca; Mario Roberto Casu; Luca P. Carloni

Accelerators integrated on-die with General-Purpose CPUs (GP-CPUs) can yield significant performance and power improvements. Their extensive use, however, is ultimately limited by their area overhead; due to their high degree of specialization, the opportunity cost of investing die real estate on accelerators can become prohibitive, especially for general-purpose architectures. In this paper we present a novel technique aimed at mitigating this opportunity cost by allowing GP-CPU cores to reuse accelerator memory as a non-uniform cache architecture (NUCA) substrate. On a system with a last level-2 cache of 128kB, our technique achieves on average a 25% performance improvement when reusing four 512 kB accelerator memory blocks to form a level-3 cache. Making these blocks reusable as NUCA slices incurs on average in a 1.89% area overhead with respect to equally-sized ad hoc cache slices.

international conference on hardware/software codesign and system synthesis | 2014

System-level memory optimization for high-level synthesis of component-based SoCs

Christian Pilato; Paolo Mantovani; Giuseppe Di Guglielmo; Luca P. Carloni

The design of specialized accelerators is essential to the success of many modern Systems-on-Chip. Electronic system-level design methodologies and high-level synthesis tools are critical for the efficient design and optimization of an accelerator. Still, these methodologies and tools offer only limited support for the optimization of the memory structures, which are often responsible for most of the area occupied by an accelerator. To address these limitations, we present a novel methodology to automatically derive the memory sub-systems of SoC accelerators. Our approach enables compositional design-space exploration and promotes design reuse of the accelerator specifications. We illustrate its effectiveness by presenting experimental results on the design of two accelerators for a high-performance embedded application.

asia and south pacific design automation conference | 2016

High-level synthesis of accelerators in embedded scalable platforms

Paolo Mantovani; Giuseppe Di Guglielmo; Luca P. Carloni

Embedded scalable platforms combine a flexible socketed architecture for heterogeneous system-on-chip (SoC) design and a companion system-level design methodology. The architecture supports the rapid integration of processor cores with many specialized hardware accelerators. The methodology simplifies the design, integration, and programming of the heterogeneous components in the SoC. In particular, it raises the level of abstraction in the design process and guides designers in the application of high-level synthesis (HLS) tools. HLS enables a more efficient design of accelerators with a focus on their algorithmic properties, a broader exploration of their design space, and a more productive reuse across many different SoC projects.

design automation conference | 2016

An FPGA-based infrastructure for fine-grained DVFS analysis in high-performance embedded systems

Paolo Mantovani; Emilio G. Cota; Kevin Tien; Christian Pilato; Giuseppe Di Guglielmo; Kenneth L. Shepard; Luca P. Carloni

Emerging technologies provide SoCs with fine-grained DVFS capabilities both in space (number of domains) and time (transients in the order oftens of nanoseconds). Analyzing these systems requires cycle-accurate accounting of rapidly-changing dynamics and complex interactions among accelerators, interconnect, memory, and OS. We present an FPGA-based infrastructure that facilitates such analyses for high-performance embedded systems. We show how our infrastructure can be used to first generate SoCs with loosely-coupled accelerators, and then perform design-space exploration considering several DVFS policies under full-system workload scenarios, sweeping spatial and temporal domain granularity.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2017

System-Level Optimization of Accelerator Local Memory for Heterogeneous Systems-on-Chip

Christian Pilato; Paolo Mantovani; Giuseppe Di Guglielmo; Luca P. Carloni

In modern system-on-chip architectures, specialized accelerators are increasingly used to improve performance and energy efficiency. The growing complexity of these systems requires the use of system-level design methodologies featuring high-level synthesis (HLS) for generating these components efficiently. Existing HLS tools, however, have limited support for the system-level optimization of memory elements, which typically occupy most of the accelerator area. We present a complete methodology for designing the private local memories (PLMs) of multiple accelerators. Based on the memory requirements of each accelerator, our methodology automatically determines an area-efficient architecture for the PLMs to guarantee performance and reduce the memory cost based on technology-related information. We implemented a prototype tool, called Mnemosyne, that embodies our methodology within a commercial HLS flow. We designed 13 complex accelerators for selected applications from two recently-released benchmark suites (Perfect and CortexSuite). With our approach we are able to reduce the memory cost of single accelerators by up to 45%. Moreover, when reusing memory IPs across accelerators, we achieve area savings that range between 17% and 55% compared to the case where the PLMs are designed separately.

ACM Transactions in Embedded Computing Systems | 2017

COSMOS: Coordination of High-Level Synthesis and Memory Optimization for Hardware Accelerators

Luca Piccolboni; Paolo Mantovani; Giuseppe Di Guglielmo; Luca P. Carloni

Hardware accelerators are key to the efficiency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-off implementations for each component of a complex hardware accelerator. However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-off point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×.Hardware accelerators are key to the efficiency and performance of system-on-chip (SoC) architectures. With high-level synthesis (HLS), designers can easily obtain several performance-cost trade-off implementations for each component of a complex hardware accelerator. However, navigating this design space in search of the Pareto-optimal implementations at the system level is a hard optimization task. We present COSMOS, an automatic methodology for the design-space exploration (DSE) of complex accelerators, that coordinates both HLS and memory optimization tools in a compositional way. First, thanks to the co-design of datapath and memory, COSMOS produces a large set of Pareto-optimal implementations for each component of the accelerator. Then, COSMOS leverages compositional design techniques to quickly converge to the desired trade-off point between cost and performance at the system level. When applied to the system-level design (SLD) of an accelerator for wide-area motion imagery (WAMI), COSMOS explores the design space as completely as an exhaustive search, but it reduces the number of invocations to the HLS tool by up to 14.6×.

Integration | 2015

A synchronous latency-insensitive RISC for better than worst-case design

Mario Roberto Casu; Paolo Mantovani

Variability of process parameters in nanometer CMOS circuits makes standard worst-case design methodology waste much of the advantages of scaling. A common-case design, though, is a perilous alternative, as it gives up much of the design yield. Better than worst-case (BTWC) design methodology reconciles performance and yield. In this paper we present a BTWC RISC processor that tolerates worst-case extra delays of critical paths without significant impact on the overall performance. We obtain this result by coupling latency-insensitive design and variable-latency (VL) units. A software built-in self-test checks VL units individually to determine whether to activate them or not. Compared to a worst-case approach, the RISC clock frequency increases by 23% in a 45nm CMOS technology. The impact of VL on instructions per cycle is circumscribed to the worst process case only and very limited, as we show through a set of benchmarks. HighlightsOur BTWC processor tolerates worst-case delays without impact on performance thanks to latency-insensitive design (LID) and variable-latency (VL) units.A fine-grain LID pipeline interlock handles stalls caused by control and data hazards, late memory access, and VL execution.Clock frequency increases by 23% in a 45-nm CMOS technology compared to the worst-case approach.Variable-latency penalties occur just in the worst process corner. No performance degradation ensues in common and best-case corners.A light-weight software built-in self-test (BIST) assigns two-cycle execution of critical instructions only when needed.

great lakes symposium on vlsi | 2011

Coupling latency-insensitivity with variable-latency for better than worst case design: a RISC case study

Mario Roberto Casu; Stefano Colazzo; Paolo Mantovani

The gap between worst and typical case delays is bound to increase in nanometer scale technologies due to the spread in process manufacturing parameters. To still profit from scaling, designs should tolerate worst case delays seamlessly and with a minimum performance degradation with respect to the typical case. We present a simple RISC core which tolerates worst case extra latency using the Latency-Insensitive Design approach coupled to a Variable-Latency mechanism. Stalls caused by excessive delay, by data and control hazards and by late memory access are dealt with in a uniform way. Compared to a pure worst-case approach, our design method permits to increase the core clock frequency by 23% in a 45 nm CMOS technology, without area and power penalty.

Explore More