Matteo Monchiero | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Matteo Monchiero is active.

Explore More

Publication

Featured researches published by Matteo Monchiero.

IEEE Computer Architecture Letters | 2009

Power Management of Datacenter Workloads Using Per-Core Power Gating

Jacob Leverich; Matteo Monchiero; Vanish Talwar; Parthasarathy Ranganathan; Christos Kozyrakis

While modern processors offer a wide spectrum of software-controlled power modes, most datacenters only rely on dynamic voltage and frequency scaling (DVFS, a.k.a. P-states) to achieve energy efficiency. This paper argues that, in the case of datacenter workloads, DVFS is not the only option for processor power management. We make the case for per-core power gating (PCPG) as an additional power management knob for multi-core processors. PCPG is the ability to cut the voltage supply to selected cores, thus reducing to almost zero the leakage power for the gated cores. Using a testbed based on a commercial 4-core chip and a set of real-world application traces from enterprise environments, we have evaluated the potential of PCPG. We show that PCPG can significantly reduce a processors energy consumption (up to 40%) without significant performance overheads. When compared to DVFS, PCPG is highly effective saving up to 30% more energy than DVFS. When DVFS and PCPG operate together they can save up to almost 60%.

international conference on information technology coding and computing | 2005

AES power attack based on induced cache miss and countermeasure

Guido Bertoni; Vittorio Zaccaria; Luca Breveglieri; Matteo Monchiero; Gianluca Palermo

This paper presents a new attack against a software implementation of the Advanced Encryption Standard. The attack aims at flushing elements of the SBOX from the cache, thus inducing a cache miss during the encryption phase. The power trace is then used to detect when the cache miss occurs; if the miss happens in the first round of the AES then the information can be used to recover part of the secret key. The attack has been simulated using the Wattch simulation framework and a simple software implementation of AES (using a single table for the SBOX). The attack can be easily extended to more sophisticated versions of AES with more than one table. Eventually, we present a simple countermeasure which does not require randomization.

international conference on supercomputing | 2006

Design space exploration for multicore architectures: a power/performance/thermal view

Matteo Monchiero; Ramon Canal; Antonio González

Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design.This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L2, and a shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the architectural level to achieve the best design choice.

ACM Sigarch Computer Architecture News | 2009

How to simulate 1000 cores

Matteo Monchiero; Jung Ho Ahn; Ayose Falcón; Daniel Ortega; Paolo Faraboschi

This paper proposes a novel methodology to efficiently simulate shared-memory multiprocessors composed of hundreds of cores. The basic idea is to use thread-level parallelism in the software system and translate it into corelevel parallelism in the simulated world. To achieve this, we first augment an existing full-system simulator to identify and separate the instruction streams belonging to the different software threads. Then, the simulator dynamically maps each instruction flow to the corresponding core of the target multi-core architecture, taking into account the inherent thread synchronization of the running applications. Our simulator allows a user to execute any multithreaded application in a conventional full-system simulator and evaluate the performance of the application on a many-core hardware. We carried out extensive simulations on the SPLASH-2 benchmark suite and demonstrated the scalability up to 1024 cores with limited simulation speed degradation vs. the single-core case on a fixed workload. The results also show that the proposed technique captures the intrinsic behavior of the SPLASH-2 suite, even when we scale up the number of shared-memory cores beyond the thousand-core limit.

IEEE Transactions on Parallel and Distributed Systems | 2008

Power/Performance/Thermal Design-Space Exploration for Multicore Architectures

Matteo Monchiero; Ramon Canal; Antonio González

Multicore architectures have been ruling the recent microprocessor design trend. This is due to different reasons: better performance, thread-level parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one), and the ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture that we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1, shared/private L2, and a shared bus interconnect. We consider a benchmark set composed of several parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size, and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption, and temperature. Design trade-offs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed, showing the importance of considering thermal effects at the architectural level to achieve the best design choice.

IEEE Transactions on Very Large Scale Integration Systems | 2006

Efficient Synchronization for Embedded On-Chip Multiprocessors

Matteo Monchiero; Gianluca Palermo; Cristina Silvano; Oreste Villa

This paper investigates optimized synchronization techniques for shared memory on-chip multiprocessors (CMPs) based on network-on-chip (NoC) and targeted at future mobile systems. The proposed solution is based on the idea of locally performing synchronization operations requiring continuous polling of a shared variable, thus, featuring large contentions (e.g., spin locks and barriers). A hardware (HW) module, the synchronization-operation buffer (SB), has been introduced to queue and to manage the requests issued by the processors. By using this mechanism, we propose a spin lock implementation requiring a constant number of network transactions and memory accesses per lock acquisition. The SB also supports an efficient implementation of barriers. Experimental validation has been carried out by using GRAPES, a cycle-accurate performance/power simulation platform for multiprocessor systems-on-chip (MPSoCs). Two different architectures have been explored to prove that the proposed approach is effective independently from caches and coherence schemes adopted. For an eight-processor target architecture, we show that the SB-based solution achieves up to 50% performance improvement and 30% energy saving with respect to synchronization based on the caching of the synchronization variables and directory-based coherence protocol. Furthermore, we prove the scalability of the proposed approach when the number of processors increases

Journal of Systems Architecture | 2007

Exploration of distributed shared memory architectures for NoC-based multiprocessors

Matteo Monchiero; Gianluca Palermo; Cristina Silvano; Oreste Villa

Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoCs), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs to provide data exchange and synchronization support. This paper focuses on the energy/delay exploration of a distributed shared memory architecture, suitable for low-power on-chip multiprocessors based on NoC. A mechanism is proposed for the data allocation on the distributed shared memory space, dynamically managed by an on-chip hardware memory management unit (HwMMU). Moreover, the exploitation of the HwMMU primitives for the migration, replication, and compaction of shared data is discussed. Experimental results show the impact of different distributed shared memory configurations for a selected set of parallel benchmark applications from the power/-performance perspective. Furthermore, a case study for a graph exploration algorithm is discussed, accounting for the effects of the core mapping and the network topology on energy and performance at the system level.

ieee computer society annual symposium on vlsi | 2007

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

Antonino Tumeo; Matteo Monchiero; Gianluca Palermo; Fabrizio Ferrandi; Donatella Sciuto

Multimedia applications, and in particular the encoding and decoding of standard image and video formats, are usually a typical target for systems-on-chip (SoC). The bi-dimensional discrete cosine transformation (2D-DCT) is a commonly used frequency transformation in graphic compression algorithms. Many hardware implementations, adopting disparate algorithms, have been proposed for field programmable gate arrays (FPGA). These designs focus either on performance or area, and often do not succeed in balancing the two aspects. In this paper, we present a design of a fast 2D-DCT hardware accelerator for a FPGA-based SoC. This accelerator makes use of a single seven stages 1D-DCT pipeline able to alternate computation for the even and odd coefficients in every cycle. In addition, it uses special memories to perform the transpose operations. Our hardware takes 80 clock cycles at 107MHz to generate a complete 8times8 2D DCT, from the writing of the first input sample to the reading of the last result (including the overhead of the interface logic). We show that this architecture provides optimal performance/area ratio with respect to several alternative designs.

great lakes symposium on vlsi | 2007

A design kit for a fully working shared memory multiprocessor on FPGA

Antonino Tumeo; Matteo Monchiero; Gianluca Palermo; Fabrizio Ferrandi; Donatella Sciuto

This paper presents a framework to design a shared memory multiprocessor on a programmable platform. We propose a complete flow, composed by a programming model and a template architecture. Our framework permits to write a parallel application by using a shared memory model. It deals with the consistency of shared data, with no need of hardware coherence protocol, but uses a software model to properlyallsynchronize the local copies with the shared memory image. This idea can be applied both to a scratchpad-based architecture or a cache-based one. The architecture is synthesizable with standard IPs, such as the softcores and interconnect elements, which may be found in any commercial FPGA toolset.

international conference on embedded computer systems: architectures, modeling, and simulation | 2006

Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors

Matteo Monchiero; Gianluca Palermo; Cristina Silvano; Oreste Villa

Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoC), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs, since its function is to provide data exchange and synchronization support. In this paper, a distributed shared memory architecture has been explored, that is suitable for low-power on-chip multiprocessors based on NoC. In particular, the paper focuses on the energy/delay exploration of on-chip physically distributed and logically shared memory address space for MP-SoCs based on a parameterizable NoC. The data allocation on the physically distributed shared memory space is dynamically managed by an on-chip hardware memory management unit. Experimental results show the impact of different NoC topologies and distributed shared memory configurations for a selected set of parallel benchmark applications from the power/performance perspective

Explore More