Cosimo Antonio Prete | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Cosimo Antonio Prete is active.

Explore More

Publication

Featured researches published by Cosimo Antonio Prete.

IEEE Transactions on Parallel and Distributed Systems | 1995

A trace-driven simulator for performance evaluation of cache-based multiprocessor systems

Cosimo Antonio Prete; Gianpaolo Prina; Luigi M. Ricciardi

We describe a simulator which emulates the activity of a shared memory, common bus multiprocessor system with private caches. Both kernel and user program activities are considered, thus allowing an accurate analysis and evaluation of coherence protocol performance. The simulator can generate synthetic traces, based on a wide set of input parameters which specify processor, kernel and workload features. Other parameters allow us to detail the multiprocessor architecture for which the analysis has to be carried out. An actual-trace-driven simulation is possible, too, in order to evaluate the performance of a specific multiprocessor with respect to a given workload, if traces concerning this workload are available. In a separate section, we describe how actual traces can also be used to extract a set of input parameters for synthetic trace generation. Finally, we show how the simulator may be successfully employed to carry out a detailed performance analysis of a specific coherence protocol. >

memory performance dealing with applications systems and architecture | 2007

Analysis of static and dynamic energy consumption in NUCA caches: initial results

Alessandro Bardine; Pierfrancesco Foglia; Giacomo Gabrielli; Cosimo Antonio Prete

NUCA caches are large L2 on-chip cache memories characterized by multi-bank partitioning and designed to hide wire delay effects. They exhibit high hit rates while keeping access latency low. Proposed designs for such caches are Static NUCA, in which data are statically allocated to the cache banks, and Dynamic NUCA, in which data may reside in different banks, and a migration mechanism is introduced to better tolerate wire delay effects. The two architectures permit to achieve different performances by acting on architectural parameters and data management policies, at the cost of different balances between static and dynamic power consumption and energy dissipation. In this work, we propose preliminary results of the characterization of such balances, by presenting an evaluation of performance and energy consumption of conventional UCAs, and Static and Dynamic NUCA caches. All the considered caches architectures are equal sized and they are supposed to be used in an aggressive high frequency system running some applications from the SPEC CPU2000 and the NAS Parallel Benchmarks suites. The experimental results obtained indicate that, although the migration of data contributes to increase the dynamic energy consumption in Dynamic NUCA caches, the higher IPC achieved permits to save static energy, which dominates the power/energy balance in all the considered architectures. As a consequence, such results would designate NUCA caches as the most performing and energy saving architectures. Besides, according to the obtained results, future power improvements for NUCA caches should concentrate on static energy, while, for the dynamic energy, the on-chip network is the most critical element. Migration of data is acceptable, since it has a positive impact on performance, and the increased dynamic energy is overwhelmed by the static energy savings resulting from the shorter execution time. In order to give a general validity to such statements, we need to explore more design space points for each architecture (by varying the running clock rate and other design parameters) and to evaluate them considering a larger set of benchmarks.

Computer-aided Design | 2012

A real-time configurable NURBS interpolator with bounded acceleration, jerk and chord error

Massimiliano Annoni; Alessandro Bardine; Stefano Campanelli; Pierfrancesco Foglia; Cosimo Antonio Prete

Advances in manufacturing technologies and in machine tools allow for unprecedented quality and efficiency in production lines, but also ask for new and increasing requirements on the motion planning and control systems. The increase of CPU processing power has permitted, in traditional CNC systems, the introduction of NURBS interpolation capabilities, thus determining a further increase in machining quality and efficiency. This has posed new and still unsolved issues, such as the need to satisfy multiple opposite constraints like limiting chord error, acceleration and jerk and offering real-time guarantees. In addition, the ability of privileging the production throughput by relaxing one or more of the previous constraints in a simple way, has emerged as another requirement of modern manufacturing plants. Nevertheless, none of the existing NURBS interpolators have these characteristics. In this work, we propose a NURBS interpolator that is able to satisfy all the manufacturing technology requirements and is able to respect, thanks to its bounded computational complexity, the position control real-time constraints. Such an interpolator is easily reconfigurable, i.e., it can relax some of the constraints while maintaining performances better than previously proposed solutions, and can be adapted in order to include constraints that were not originally considered. Performances of the proposed algorithm have been evaluated both by simulations and by real milling experiments.

IEEE Transactions on Parallel and Distributed Systems | 1999

PSCR: a coherence protocol for eliminating passive sharing in shared-bus shared-memory multiprocessors

Roberto Giorgi; Cosimo Antonio Prete

In high-performance general-purpose workstations and servers, the workload can be typically constituted of both sequential and parallel applications. Shared-bus shared-memory multiprocessor can be used to speed-up the execution of such workload. In this environment, the scheduler takes care of the load balancing by allocating a ready process on the first available processor, thus producing process migration. Process migration and the persistence of private data into different caches produce an undesired sharing, named passive sharing. The copies due to passive sharing produce useless coherence traffic on the bus and coping with such a problem may represent a challenging design problem for these machines. Many protocols use smart solutions to limit the overhead to maintain coherence among shared copies. None of these studies treats passive-sharing directly, although some indirect effect is present while dealing with the other kinds of sharing. Affinity scheduling can alleviate this problem, but this technique does not adapt to all load conditions, especially when the effects of migration are massive. We present a simple coherence protocol that eliminates passive sharing using information from the compiler that is normally available in operating system kernels. We evaluate the performance of this protocol and compare it against other solutions proposed in the literature by means of enhanced trace-driven simulation. We evaluate the complexity in terms of the number of protocol states, additional bus lines, and required software support. Our protocol further limits the coherence-maintaining overhead by using information about access patterns to shared data exhibited in parallel applications.

IEEE Parallel & Distributed Technology: Systems & Applications | 1995

Graphical design of distributed applications through reusable components

Alberto Bartoli; Paolo Corsini; Gianluca Dini; Cosimo Antonio Prete

The Tracs graphical programming environment promotes a modular approach to the development of distributed applications. A few types of reusable design components make the environment both simple and powerful. Tracs exploits modularity in an original way. Its support of message models, task models, and architecture models as basic design components provides programmers with a framework that has proven practical, powerful, and easy to understand. Furthermore, modularity has allowed us to add advanced facilities to the environment, with little implementation and integration effort. From this point of view, our choice of supporting message models as a basic design component has proven appropriate. Several of the ideas explored in Tracs will be useful in future work on programming environments for parallel and distributed systems. >

embedded systems for real-time multimedia | 2005

A NUCA model for embedded systems cache design

Pierfrancesco Foglia; Daniele Mangano; Cosimo Antonio Prete

Embedded applications require high performance processors integrating fast and low-power cache. Dynamic non-uniform cache architectures (D-NUCA) have been proposed to overcome the performance limit introduced by wire delays when designing large cache. In this paper, we propose alternative designs of D-NUCA cache, namely triangular D-NUCA cache, to reduce power consumption and silicon area occupancy of D-NUCA cache. We compare the performances of triangular D-NUCA cache with conventional rectangular organization. Results show that our approach is particular useful in the embedded applications domain, as it permits the utilization of half-sized NUCA cache with performance improvements.

International Journal of High Performance Systems Architecture | 2010

Way adaptable D-NUCA caches

Alessandro Bardine; Manuel Comparetti; Pierfrancesco Foglia; Giacomo Gabrielli; Cosimo Antonio Prete

Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a scalable on-chip network to interconnect the banks with the cache controller, the average access latency can be reduced with respect to a traditional cache. The addition of a migration mechanism to move the most frequently accessed data towards the cache controller (D-NUCA) further improves the average access latency. In this work we propose a last-level cache design, based on the D-NUCA scheme, which is able to significantly limit its static power consumption by dynamically adapting to the needs of the running application: the way adaptable D-NUCA cache. This design leads to a fast and power-efficient memory hierarchy with an average reduction by 31.2% in energy-delay product (EDP) with respect to a traditional D-NUCA. We propose and discuss a methodology for tuning the intrinsic parameters of our design and investigate the adoption of the way adaptable D-NUCA scheme as a shared L2 cache in a chip multiprocessor (CMP) system (24% reduction of EDP).

international symposium on microarchitecture | 1997

The ChARM tool for tuning embedded systems

Cosimo Antonio Prete; Marco Graziano; Francesco Lazzarini

ChARM is a simulation tool for tuning ARM-based embedded systems that include cache memories. ChARM provides a parametric, trace-driven simulation for tuning system configuration. A designer can observe performance while varying the timing, the architectural features, and the management policies of the system components. Designers can therefore evaluate the execution time of the program, the time spent in memory accesses, miss ratio, code miss ratio, and data miss ratio, and the number of burst-read operations. They can also evaluate the number of write operations for write-through cache models and burst-write operations for copy-back cache models. finally, ChARMs program locality analysis illustrates the sequentiality, temporality, and loops of a program in easy-to-read three dimensional graphs. These graphs, together with the graphs showing the distribution of the replacement conflicts in cache, help designers understand how a program works and how it stresses the memory hierarchy.

IEEE Concurrency | 1997

Trace Factory: generating workloads for trace-driven simulation of shared-bus multiprocessors

Roberto Giorgi; Cosimo Antonio Prete; Gianpaolo Prina; Luigi M. Ricciardi

A major concern with high-performance general-purpose workstations is to speed up the execution of commands, uniprocess applications, and multiprocess applications with coarse- to medium-grain parallelism. The authors have developed a methodology and a set of tools to generate traces for the performance evaluation of shared-bus, shared-memory multiprocessor systems. Trace Factory produces traces representing significant real workloads consisting of a flexible set of commands and uniprocess and multiprocess user applications. The authors evaluate its accuracy and show how it can be used to evaluate and compare the performance of five coherence protocols.

digital systems design | 2008

Leveraging Data Promotion for Low Power D-NUCA Caches

Alessandro Bardine; Manuel Comparetti; Pierfrancesco Foglia; Giacomo Gabrielli; Cosimo Antonio Prete; Per Stenström

D-NUCA caches are cache memories that, thanks to banked organization, broadcast search and promotion/demotion mechanism, are able to tolerate the increasing wire delay effects introduced by technology scaling. As a consequence, they will outperform conventional caches (UCA, Uniform CacheArchitectures) in future generation cores. Due to the promotion/ demotion mechanism, we observed that the distribution of hits across the ways of a D-NUCA cache varies across applications as well as across different execution phases within a single application. In this work, we show how such a behavior can be leveraged to improve the D-NUCA power efficiency as well as to decrease its access latency.In particular, we propose: 1) A new micro-architectural technique to reduce the static power consumption of a D-NUCA cache by dynamically adapting the number of active (i.e. powered-on) ways to the need of the running application; our evaluation shows that a strong reduction of the average number of active ways (37.1%) is achievable, without significantly affecting the IPC (-2.83%), leading to a resultant reduction of the Energy Delay Product (EDP) of 30.9%. 2) A strategy to estimate the characteristic parameters of the proposed technique. 3) An evaluation of the effectiveness of the proposed technique in the multicore environment.

Explore More