Alberto Scionti
University of Siena
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Alberto Scionti.
Future Generation Computer Systems | 2015
Roberto Giorgi; Alberto Scionti
Large synchronization and communication overhead will become a major concern in future extreme-scale machines (e.g., HPC systems, supercomputers). These systems will push upwards performance limits by adopting chips equipped with one order of magnitude more cores than today. Alternative execution models can be explored in order to exploit the high parallelism offered by future massive many-core chips. This paper proposes the integration of standard cores with dedicated co-processing units that enable the system to support a fine-grain data-flow execution model developed within the TERAFLUX project. An instruction set architecture extension for supporting fine-grain thread scheduling and execution is proposed. This instruction set extension is supported by the co-processor that provides hardware units for accelerating thread scheduling and distribution among the available cores. Two fundamental aspects are at the base of the proposed system: the programmers can adopt their preferred programming model, and the compilation tools can produce a large set of threads mainly communicating in a producer-consumer fashion, hence enabling data-flow execution. Experimental results demonstrate the feasibility of the proposed approach and its capability of scaling with the increasing number of cores. We present a data-flow based co-processor supporting the execution of fine-grain threads.We propose a minimalistic core ISA extension for data-flow threads.We propose a two-level hierarchical scheduling co-processor that implements the ISA extension.We show the scalability of the proposed system through a set of experimental results.
international conference on artificial intelligence | 2014
Nam Ho; Antoni Portero; Marcos Solinas; Alberto Scionti; Andrea Mondelli; Paolo Faraboschi; Roberto Giorgi
The trend to develop increasingly more intelligent systems leads directly to a considerable demand for more and more computational power. Programming models that aid to exploit the application parallelism with current multi-core systems exist but with limitations. From this perspective, new execution models are arising to surpass limitations to scale up the number of processing elements, while dedicated hardware can help the scheduling of the threads in many-core systems. This paper depicts a data-flow based execution model that exposes to the multi-core x8664 architecture up to millions of fine-grain threads. We propose to augment the existing architecture with a hardware thread scheduling unit. The functionality of this unit is exposed by means of four dedicated instructions. Results with a pure data-flow application (i.e., Recursive Fibonacci) show that the hardware scheduling unit can load the computing cores (up to 32 in our tests) in a more efficient way than run-time managed threads generated by programming models (e.g., OpenMP and Cilk). Further, our solution shows better scaling and smaller saturation when the number of workers increases.
computing frontiers | 2015
Nam Ho; Andrea Mondelli; Alberto Scionti; Marco Solinas; Antoni Portero; Roberto Giorgi
Future exascale machines will require multi--/ many-core architectures able to efficiently run multi-threaded applications. Data-flow execution models have demonstrated to be capable of improving execution performance by limiting the synchronization overhead. This paper proposes to augment cores with a minimalistic set of hardware units and dedicated instructions that allow efficiently scheduling the execution of threads on the basis of data-flow principles. Experimental results show performance improvements of the system when compared with other techniques (e.g., OpenMP, Cilk).
digital systems design | 2015
Andrea Mondelli; Nam Ho; Alberto Scionti; Marco Solinas; Antoni Portero; Roberto Giorgi
The path towards future high performance computers requires architectures able to efficiently run multi-threaded applications. In this context, dataflow-based execution models can improve the performance by limiting the synchronization overhead, thanks to a simple producer-consumer approach. This paper advocates the ISE of standard cores with a small hardware extension for efficiently scheduling the execution of threads on the basis of dataflow principles. A set of dedicated instructions allow the code to interact with the scheduler. Experimental results demonstrate that, the combination of dedicated scheduling units and a dataflow execution model improve the performance when compared with other techniques for code parallelization (e.g., OpenMP, Cilk).
mediterranean conference on embedded computing | 2014
Alberto Scionti; Stamatis Kavvadias; Roberto Giorgi
Discovering the most appropriate reconfiguration instants for improving performance and lowering power consumption is not a trivial problem. In this paper we show the benefit in terms of performance gain and power reduction of the dynamic adaptation (e.g., cache size, clock frequency, and core issue-width) of an embedded platform, through a design space exploration campaign, and focusing on a relevant case study. To this end, we analyze a set of benchmarks belonging to the embedded application domain with the aim of illustrating how the appropriate selection of reconfiguration instants can positively influence system performance and power consumption. Experimental results using the cjpeg benchmark show that power consumption can be reduced by an average of 22%. Our methodology can be used to create a set of run-time management policies for driving the adaptation process.
Sensors | 2018
Alberto Scionti; Somnath Mazumdar; Antoni Portero
The rapid evolution of Cloud-based services and the growing interest in deep learning (DL)-based applications is putting increasing pressure on hyperscalers and general purpose hardware designers to provide more efficient and scalable systems. Cloud-based infrastructures must consist of more energy efficient components. The evolution must take place from the core of the infrastructure (i.e., data centers (DCs)) to the edges (Edge computing) to adequately support new/future applications. Adaptability/elasticity is one of the features required to increase the performance-to-power ratios. Hardware-based mechanisms have been proposed to support system reconfiguration mostly at the processing elements level, while fewer studies have been carried out regarding scalable, modular interconnected sub-systems. In this paper, we propose a scalable Software Defined Network-on-Chip (SDNoC)-based architecture. Our solution can easily be adapted to support devices ranging from low-power computing nodes placed at the edge of the Cloud to high-performance many-core processors in the Cloud DCs, by leveraging on a modular design approach. The proposed design merges the benefits of hierarchical network-on-chip (NoC) topologies (via fusing the ring and the 2D-mesh topology), with those brought by dynamic reconfiguration (i.e., adaptation). Our proposed interconnect allows for creating different types of virtualised topologies aiming at serving different communication requirements and thus providing better resource partitioning (virtual tiles) for concurrent tasks. To further allow the software layer controlling and monitoring of the NoC subsystem, a few customised instructions supporting a data-driven program execution model (PXM) are added to the processing element’s instruction set architecture (ISA). In general, the data-driven programming and execution models are suitable for supporting the DL applications. We also introduce a mechanism to map a high-level programming language embedding concurrent execution models into the basic functionalities offered by our SDNoC for easing the programming of the proposed system. In the reported experiments, we compared our lightweight reconfigurable architecture to a conventional flattened 2D-mesh interconnection subsystem. Results show that our design provides an increment of the data traffic throughput of 9.5% and a reduction of 2.2× of the average packet latency, compared to the flattened 2D-mesh topology connecting the same number of processing elements (PEs) (up to 1024 cores). Similarly, power and resource (on FPGA devices) consumption is also low, confirming good scalability of the proposed architecture.
international conference on high performance computing and simulation | 2017
Alberto Scionti; Somnath Mazumdar; Antoni Portero
Continuous demand for higher performance is adding more pressure on hardware designers to provide faster machines with low energy consumption. Recent technological advancements allow placing a group of silicon dies on top of a conventional interposer (silicon layer), which provides space to integrate logic and interconnection resources to manage active processing cores. However, such large resource availability requires an adequate Program eXecution Model (PXM) as well as an efficient mechanism to allocate resources in the system. From this perspective, fine-grain data-driven PXMs represent an attractive solution to reduce the cost of synchronising concurrent activities. The contribution of this work is twofold. First, a hardware architecture called TALHES - a Task ALlocator for HEterogeneous System is proposed to support scheduling of multi-threaded applications (adhering to an explicit data-driven PXM). TALHES introduces a Network-on-Chip (NoC) extension: i) while on-chip 2D-mesh NoCs are used to support locality of computations in the execution of a single task; ii) a global task scheduler integrated into the silicon interposer orchestrates application tasks among different clusters of cores (eventually with different computing capabilities). The second contribution of the paper is a simulation framework that is tailored to support the analysis of such fine-grain data-driven applications. In this work, Linux Containers are used to abstract and efficiently simulate clusters of cores (i.e., a single die), as well as the behaviour of the global scheduling unit.
annual simulation symposium | 2012
Antoni Portero; Alberto Scionti; Zhibin Yu; Paolo Faraboschi; Caroline Concatto; Luigi Carro; Arne Garbade; Sebastian Weis; Theo Ungerer; Roberto Giorgi
ieee international advance computing conference | 2017
Somnath Mazumdar; Alberto Scionti
architectural support for programming languages and operating systems | 2012
Roberto Giorgi; Alberto Scionti; Antoni Portero