Christian Pinto | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Christian Pinto is active.

Explore More

Publication

Featured researches published by Christian Pinto.

ieee international symposium on parallel & distributed processing, workshops and phd forum | 2013

VirtualSoC: A Full-System Simulation Environment for Massively Parallel Heterogeneous System-on-Chip

Daniele Bortolotti; Christian Pinto; Andrea Marongiu; Martino Ruggiero; Luca Benini

Driven by flexibility, performance and cost constraints of demanding modern applications, heterogeneous System-on-Chip (SoC) is the dominant design paradigm in the embedded system computing domain. SoC architecture and heterogeneity clearly provide a wider power/performance scaling, combining high performance and power efficient general-purpose cores along with massively parallel many-core-based accelerators. Besides the complex hardware, generally these kinds of platforms host also an advanced software ecosystem, composed by an operating system, several communication protocol stacks, and various computational demanding user applications. The necessity to efficiently cope with the hugeHW/SW design space provided by this scenario makes clearly full-system simulator one of the most important design tools. We present in this paper a new emulation framework, called Virtual SoC, targeting the full-system simulation of massively parallel heterogeneous SoCs.

international symposium on system-on-chip | 2011

Exploring instruction caching strategies for tightly-coupled shared-memory clusters

Daniele Bortolotti; Francesco Paterna; Christian Pinto; Andrea Marongiu; Martino Ruggiero; Luca Benini

Several Chip-Multiprocessor designs today leverage tightly-coupled computing clusters as a building block. These clusters consist of a fairly large number N of simple cores, featuring fast communication through a shared multibanked L1 data memory and ≈ 1 Instruction-Per-Cycle (IPC) per core. Thus, aggregated I-fetch bandwidth approaches ƒ * N, where ƒ is the cluster clock frequency. An effective instruction cache architecture is key to support this I-fetch bandwidth. In this paper we compare two main architectures for instruction caching targeting tightly coupled CMP clusters: (i) private instruction caches per core and (ii) shared instruction cache per cluster. We developed a cycle-accurate model of the tightly coupled cluster with several configurable architectural parameters for exploration, plus a programming environment targeted at efficient data-parallel computing. We conduct an in-depth study of the two architectural templates based on the use of both synthetic microbenchmarks and real program workloads. Our results provide useful insights and guidelines for designers.

international conference on high performance computing and simulation | 2010

Scalable instruction set simulator for thousand-core architectures running on GPGPUs

Shivani Raghav; Martino Ruggiero; David Atienza; Christian Pinto; Andrea Marongiu; Luca Benini

Simulators are still the primary tools for development and performance evaluation of applications running on massively parallel architectures. However, current virtual platforms are not able to tackle the complexity issues introduced by 1000-core future scenarios. We present a fast and accurate simulation framework targeting extremely large parallel systems by specifically taking advantage of the inherent potential processing parallelism available in modern GPGPUs.

ieee/acm international symposium cluster, cloud and grid computing | 2011

GPGPU-Accelerated Parallel and Fast Simulation of Thousand-Core Platforms

Christian Pinto; Shivani Raghav; Andrea Marongiu; Martino Ruggiero; David Atienza; Luca Benini

The multicore revolution and the ever-increasing complexity of computing systems is dramatically changing sys-tem design, analysis and programming of computing platforms. Future architectures will feature hundreds to thousands of simple processors and on-chip memories connected through a network-on-chip. Architectural simulators will remain primary tools for design space exploration, software development and performance evaluation of these massively parallel architectures. However, architectural simulation performance is a serious concern, as virtual platforms and simulation technology are not able to tackle the complexity of thousands of core future scenarios. The main contribution of this paper is the development of a new simulation approach and technology for many core processors which exploit the enormous parallel processing capability of low-cost and widely available General Purpose Graphic Processing Units (GPGPU). The simulation of many-core architectures exhibits indeed a high level of parallelism and is inherently parallelizable, but GPGPU acceleration of architectural simulation requires an in-depth revision of the data structures and functional partitioning traditionally used in parallel simulation. We demonstrate our GPGPU simulator on a target architecture composed by several cores (i.e. ARM ISA based), with instruction and data caches, connected through a Network-on-Chip (NoC). Our experiments confirm the feasibility of our approach.

IEEE Transactions on Parallel and Distributed Systems | 2015

GPU Acceleration for Simulating Massively Parallel Many-Core Platforms

Shivani Raghav; Martino Ruggiero; Andrea Marongiu; Christian Pinto; David Atienza; Luca Benini

Emerging massively parallel architectures such as a general-purpose processor plus many-core programmable accelerators are creating an increasing demand for novel methods to perform their architectural simulation. Most state-of-the-art simulation technologies are exceedingly slow and the need to model full system many-core architectures adds further to the complexity issues. This paper presents a fast, scalable and parallel simulator, which uses a novel methodology to accelerate the simulation of a many-core coprocessor using GPU platforms. The main idea is to use. The target architecture of the associated. Simulation of many target nodes is mapped to the many hardware-threads available on highly parallel GPU platforms. This paper presents a novel methodology to accelerate the simulation of many-core coprocessors using GPU platforms. We demonstrate the challenges, feasibility and benefits of our idea to use heterogeneous system (CPU and GPU) to simulate future architecture of many-core heterogeneous platforms. The target architecture selected to evaluate our methodology consists of an ARM general purpose CPU coupled with many-core coprocessor with thousands of simple in-order cores connected in a tile network. This work presents optimization techniques used to parallelize the simulation specifically for acceleration on GPUs. We partition the full system simulation between CPU and GPU, where the target general purpose CPU is simulated on the host CPU, whereas the many-core coprocessor is simulated on the NVIDIA Tesla 2070 GPU platform. Our experiments show performance of up to 50 MIPS when simulating the entire heterogeneous chip, and high scalability with increasing cores on coprocessor.

architectural support for programming languages and operating systems | 2012

Full system simulation of many-core heterogeneous SoCs using GPU and QEMU semihosting

Shivani Raghav; Andrea Marongiu; Christian Pinto; David Atienza; Martino Ruggiero; Luca Benini

Modern system-on-chips are evolving towards complex and heterogeneous platforms with general purpose processors coupled with massively parallel manycore accelerator fabrics (e.g. embedded GPUs). Platform developers are looking for efficient full-system simulators capable of simulating complex applications, middleware and operating systems on these heterogeneous targets. Unfortunately current virtual platforms are not able to tackle the complexity and heterogeneity of state-of-the-art SoCs. Software emulators, such as the open-source QEMU project, cope quite well in terms of simulation speed and functional accuracy with homogeneous coarse-grained multi-cores. The main contribution of this paper is the introduction of a novel virtual prototyping technique which exploits the heterogeneous accelerators available in commodity PCs to tackle the heterogeneity challenge in full-SoC system simulation. In a nutshell, our approach makes it possible to partition simulation between the host CPU and GPU. More specifically, QEMU runs on the host CPU and the simulation of manycore accelerators is offloaded, through semi-hosting, to the host GPU. Our experimental results confirm the flexibility and efficiency of our enhanced QEMU environment.

design, automation, and test in europe | 2015

Reducing energy consumption in microcontroller-based platforms with low design margin co-processors

Andres Gomez; Christian Pinto; Andrea Bartolini; Davide Rossi; Luca Benini; Sh Hamed Fatemi; J José Pineda de Gyvez

Advanced energy minimization techniques (i.e. DVFS, Thermal Management, etc) and their high-level HW/SW requirements are well established in high-throughput multi-core systems. These techniques would have an intolerable overhead in low-cost, performance-constrained microcontroller units (MCUs). These devices can further reduce power by operating at a lower voltage, at the cost of increased sensitivity to PVT variation and increased design margins. In this paper, we propose an runtime environment for next-generation dual-core MCU platforms. These platforms complement a single-core with a low area overhead, reduced design margin shadow-processor. The runtime decreases the overall energy consumption by exploiting design corner heterogeneity between the two cores, rather than increasing the throughput. This allows the platforms power envelope to be dynamically adjusted to application-specific requirements. Our simulations show that, depending on the ratio of core to platform energy, total energy savings can be up to 20%.

Concurrency and Computation: Practice and Experience | 2013

SIMinG‐1k: A thousand‐core simulator running on general‐purpose graphical processing units

Shivani Raghav; Andrea Marongiu; Christian Pinto; Martino Ruggiero; David Atienza; Luca Benini

This paper introduces SIMinG‐1k—a manycore simulator infrastructure. SIMinG‐1k is a graphics processing unit accelerated, parallel simulator for design‐space exploration of large‐scale manycore systems. It features an optimal trade‐off between modeling accuracy and simulation speed. Its main objectives are high performance, flexibility, and ability to simulate thousands of cores. SIMinG‐1k can model different architectures (currently, we support ARM (Available from: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0100i/index.html) and Intel x86) using two‐step approac where architecture specific front end is decoupled from a fast and parallel manycore virtual machine running on graphical processing unit platform. We evaluate the simulator for target architecture with up to 4096 cores. Our results demonstrate very high scalability and almost linear speedup with simulation of increasing number of cores.Copyright

application specific systems architectures and processors | 2013

A highly efficient, thread-safe software cache implementation for tightly-coupled multicore clusters

Christian Pinto; Luca Benini

A widely adopted design paradigm for many-core accelerators features processing elements grouped in clusters. Due to area, power and design simplicity, processors in the same clusters are often not equipped with data-caches but rather share a tightly coupled data memory (TCDM). Even if the use of a TCDM is more energy and area efficient than a cache it requires a higher programming effort because memory needs to be explicitly managed with DMA-based L3 to TCDM copies. In this context Software Caches can be used to automatically transfer data between the local TCDM and the external memory, simplifying the task of the programmer. In this paper we present an implementation of Software Cache for the STMicroelectronics STHORM many-core accelerator, featuring a L1 TCDM shared by 16 processors in a cluster. Our main contribution is the design of a fast and thread-safe cache allowing parallel access from different processing elements inside the same cluster. We evaluate our implementation with micro-benchmarks as well as a real world application from the computer vision domain. Results show that a software cache provides major performance improvements with respect to L3 allocation of large data structures even when it is aggressively shared among many parallel threads.

digital systems design | 2017

Paving the Way Towards a Highly Energy-Efficient and Highly Integrated Compute Node for the Exascale Revolution: The ExaNoDe Approach

Alvise Rigo; Christian Pinto; Kevin Pouget; Daniel Raho; Denis Dutoit; Pierre-Yves Martinez; Chris Doran; Luca Benini; Iakovos Mavroidis; Manolis Marazakis; Valeria Bartsch; Guy Lonsdale; Antoniu Pop; John Goodacre; Annaik Colliot; Paul M. Carpenter; Petar Radojković; Dirk Pleiter; Dominique Drouin; Benoît Dupont de Dinechin

Power consumption and high compute density are the key factors to be considered when building a compute node for the upcoming Exascale revolution. Current architectural design and manufacturing technologies are not able to provide the requested level of density and power efficiency to realise an operational Exascale machine. A disruptive change in the hardware design and integration process is needed in order to cope with the requirements of this forthcoming computing target. This paper presents the ExaNoDe H2020 research project aiming to design a highly energy efficient and highly integrated heterogeneous compute node targeting Exascale level computing, mixing low-power processors, heterogeneous co-processors and using advanced hardware integration technologies with the novel UNIMEM Global Address Space memory system.

Explore More