Roberto Giorgi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Roberto Giorgi is active.

Explore More

Publication

Featured researches published by Roberto Giorgi.

IEEE Transactions on Computers | 2001

Scheduled dataflow: execution paradigm, architecture, and performance evaluation

Krishna M. Kavi; Roberto Giorgi; Joseph Arul

In this paper, the scheduled dataflow (SDF) architecture-a decoupled memory/execution, multithreaded architecture using nonblocking threads-is presented in detail and evaluated against superscalar architecture. Recent focus in the field of new processor architectures is mainly on VLIW (e.g., IA-64), superscalar, and superspeculative designs. This trend allows for better performance, but at the expense of increased hardware complexity and, possibly, higher power expenditures resulting from dynamic instruction scheduling. Our research deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow and multithreading. A program is partitioned into nonblocking execution threads. In addition, all memory accesses are decoupled from the threads execution. Data is preloaded into the threads context (registers) and all results are poststored after the completion of the threads execution. While multithreading and decoupling are possible with control-flow architectures, SDF makes it easier to coordinate the memory accesses and execution of a thread, as well as eliminate unnecessary dependencies among instructions. We have compared the execution cycles required for programs on SDF with the execution cycles required by programs on SimpleScalar (a superscalar simulator) by considering the essential aspects of these architectures in order to have a fair comparison. The results show that SDF architecture can outperform the superscalar. SDF performance scales better with the number of functional units and allows for a good exploitation of Thread Level Parallelism (TLP) and available chip area.

Microprocessors and Microsystems | 2014

TERAFLUX: Harnessing dataflow in next generation teradevices

Roberto Giorgi; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Rahulkumar Gayatri; Sylvain Girbal; Daniel Goodman; Behram Khan; Souad Koliai; Joshua Landwehr; Nhat Minh Lê; Feng Li; Mikel Luján; Avi Mendelson; Laurent Morin; Nacho Navarro; Tomasz Patejko; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Ian Watson; Sebastian Weis; Stéphane Zuckerman; Mateo Valero

The improvements in semiconductor technologies are gradually enabling extreme-scale systems such as teradevices (i.e., chips composed by 1000 billion of transistors), most likely by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper presents an overview of the research carried out by the TERAFLUX partners and some preliminary results. Our platform comprises 1000+ general purpose cores per chip in order to properly explore the above challenges. An architectural template has been proposed and applications have been ported to the platform. Programming models, compilation tools, and reliability techniques have been developed. The evaluation is carried out by leveraging on modifications of the HP-Labs COTSon simulator.

symposium on computer architecture and high performance computing | 2007

DTA-C: A Decoupled multi-Threaded Architecture for CMP Systems

Roberto Giorgi; Zdravko Popovic; Nikola Puzovic

One way to exploit Thread Level Parallelism (TLP) is to use architectures that implement novel multithreaded execution models, like Scheduled Data- Flow (SDF). This latter model promises an elegant decoupled and non-blocking execution of threads. Here we extend that model in order to be used in future scalable CMP systems where wire delay imposes to partition the design. In this paper we describe our approach and experiment with different distributed schedulers, different number of clusters and processors per cluster to show good scalability of our architecture. We describe our approach and present initial results on system scalability and performance. We suggest design choices to improve the scalability of the basic design.

IEEE Transactions on Parallel and Distributed Systems | 1999

PSCR: a coherence protocol for eliminating passive sharing in shared-bus shared-memory multiprocessors

Roberto Giorgi; Cosimo Antonio Prete

In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.

Archive | 2015

Guide to DataFlow Supercomputing: Basic Concepts, Case Studies, and a Detailed Example

Veljko Milutinovic; Jakob Salom; Nemanja Trifunovic; Roberto Giorgi

This unique text/reference describes an exciting and novel approach to supercomputing in the DataFlow paradigm. The major advantages and applications of this approach are clearly described, and a detailed explanation of the programming model is provided using simple yet effective examples. The work is developed from a series of lecture courses taught by the authors in more than 40 universities across more than 20 countries, and from research carried out by Maxeler Technologies, Inc. Topics and features: presents a thorough introduction to DataFlow supercomputing for big data problems; reviews the latest research on the DataFlow architecture and its applications; introduces a new method for the rapid handling of real-world challenges involving large datasets; provides a case study on the use of the new approach to accelerate the Cooley-Tukey algorithm on a DataFlow machine; includes a step-by-step guide to the web-based integrated development environment WebIDE.

Future Generation Computer Systems | 2015

A scalable thread scheduling co-processor based on data-flow principles

Roberto Giorgi; Alberto Scionti

Large synchronization and communication overhead will become a major concern in future extreme-scale machines (e.g., HPC systems, supercomputers). These systems will push upwards performance limits by adopting chips equipped with one order of magnitude more cores than today. Alternative execution models can be explored in order to exploit the high parallelism offered by future massive many-core chips. This paper proposes the integration of standard cores with dedicated co-processing units that enable the system to support a fine-grain data-flow execution model developed within the TERAFLUX project. An instruction set architecture extension for supporting fine-grain thread scheduling and execution is proposed. This instruction set extension is supported by the co-processor that provides hardware units for accelerating thread scheduling and distribution among the available cores. Two fundamental aspects are at the base of the proposed system: the programmers can adopt their preferred programming model, and the compilation tools can produce a large set of threads mainly communicating in a producer-consumer fashion, hence enabling data-flow execution. Experimental results demonstrate the feasibility of the proposed approach and its capability of scaling with the increasing number of cores. We present a data-flow based co-processor supporting the execution of fine-grain threads.We propose a minimalistic core ISA extension for data-flow threads.We propose a two-level hierarchical scheduling co-processor that implements the ISA extension.We show the scalability of the proposed system through a set of experimental results.

digital systems design | 2013

The TERAFLUX Project: Exploiting the DataFlow Paradigm in Next Generation Teradevices

Marco Solinas; Rosa M. Badia; François Bodin; Albert Cohen; Paraskevas Evripidou; Paolo Faraboschi; Bernhard Fechner; Guang R. Gao; Arne Garbade; Sylvain Girbal; Daniel Goodman; Behran Khan; Souad Koliai; Feng Li; Mikel Luján; Laurent Morin; Avi Mendelson; Nacho Navarro; Antoniu Pop; Pedro Trancoso; Theo Ungerer; Mateo Valero; Sebastian Weis; Ian Watson; Stéphane Zuckermann; Roberto Giorgi

Thanks to the improvements in semiconductor technologies, extreme-scale systems such as teradevices (i.e., composed by 1000 billion of transistors) will enable systems with 1000+ general purpose cores per chip, probably by 2020. Three major challenges have been identified: programmability, manageable architecture design, and reliability. TERAFLUX is a Future and Emerging Technology (FET) large-scale project funded by the European Union, which addresses such challenges at once by leveraging the dataflow principles. This paper describes the project and provides an overview of the research carried out by the TERAFLUX consortium.

memory performance dealing with applications systems and architecture | 2004

A workload characterization of elliptic curve cryptography methods in embedded environments

Irina Branovic; Roberto Giorgi; Enrico Martinelli

Elliptic Curve Cryptography (ECC) is emerging as an attractive public-key system for constrained environments, because of the small key sizes and computational efficiency, while preserving the same security level as the standard methodsWe have developed a set of benchmarks to compare standard and corresponding elliptic curve public-key methods. An embedded device based on the Intel XScale architecture, which utilizes an ARM processor core was modeled and used for studying the benchmark performance. Different possible variations for the memory hierarchy of such basic architecture were considered. We compared our benchmarks with MiBench/Security, another widely accepted benchmark set, to provide a reference for our evaluation.We studied operations and impact on memory of Diffie-Hellman key exchange, digital signature algorithm, ElGamal, and RSA public-key cryptosystems. Elliptic curve cryptosystems are more efficient in terms of execution time, but their impact on memory subsystem has to be taken into account when designing embedded devices in order to achieve better performance.

Journal of Universal Computer Science | 2000

Execution and Cache Performance of the Scheduled Dataflow Architecture

Krishna M. Kavi; Joseph Arul; Roberto Giorgi

This paper presents an evaluation of our Scheduled Dataflow (SDF) Processor. Recent focus in the field of new processor architectures is mainly on VLIW (e.g. IA-64), superscalar and superspeculative architectures. This trend allows for better performance at the expense of an increased hardware complexity and a brute-force solution to the memory-wall problem. Our research substantially deviates from this trend by exploring a simpler, yet powerful execution paradigm that is based on dataflow concepts. A program is partitioned into functional execution threads, which are perfectly suited for our non-blocking multithreaded architecture. In addition, all memory accesses are decoupled from the threads execution. Data is pre-loaded into the threads context (registers), and all results are post-stored after the completion of the threads execution. The decoupling of memory accesses from thread execution requires a separate unit to perform the necessary pre-loads and post-stores, and to control the allocation of hardware thread contexts to enabled threads. The analytical analysis of our architecture showed that we could achieve a better performance than other classical dataflow architectures (i.e., ETS), hybrid models (e.g., EARTH) and decoupled multithreaded architectures (e.g., Rhamma processor). This paper analyzes the architecture using an instruction set level simulator for a variety of benchmark programs. We compared the execution cycles required for programs on SDF with the execution cycles required by the programs on DLX (or MIPS). Then we investigated the expected cache-memory performance by collecting address traces from programs and using a trace-driven cache simulator (Dinero-IV). We present these results in this paper.

IEEE Concurrency | 1997

Trace Factory: generating workloads for trace-driven simulation of shared-bus multiprocessors

Roberto Giorgi; Cosimo Antonio Prete; Gianpaolo Prina; Luigi M. Ricciardi

A major concern with high-performance general-purpose workstations is to speed up the execution of commands, uniprocess applications, and multiprocess applications with coarse- to medium-grain parallelism. The authors have developed a methodology and a set of tools to generate traces for the performance evaluation of shared-bus, shared-memory multiprocessor systems. Trace Factory produces traces representing significant real workloads consisting of a flexible set of commands and uniprocess and multiprocess user applications. The authors evaluate its accuracy and show how it can be used to evaluate and compare the performance of five coherence protocols.

Explore More