Pierfrancesco Foglia | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pierfrancesco Foglia is active.

Explore More

Publication

Featured researches published by Pierfrancesco Foglia.

memory performance dealing with applications systems and architecture | 2007

Analysis of static and dynamic energy consumption in NUCA caches: initial results

Alessandro Bardine; Pierfrancesco Foglia; Giacomo Gabrielli; Cosimo Antonio Prete

NUCA caches are large L2 on-chip cache memories characterized by multi-bank partitioning and designed to hide wire delay effects. They exhibit high hit rates while keeping access latency low. Proposed designs for such caches are Static NUCA, in which data are statically allocated to the cache banks, and Dynamic NUCA, in which data may reside in different banks, and a migration mechanism is introduced to better tolerate wire delay effects. The two architectures permit to achieve different performances by acting on architectural parameters and data management policies, at the cost of different balances between static and dynamic power consumption and energy dissipation. In this work, we propose preliminary results of the characterization of such balances, by presenting an evaluation of performance and energy consumption of conventional UCAs, and Static and Dynamic NUCA caches. All the considered caches architectures are equal sized and they are supposed to be used in an aggressive high frequency system running some applications from the SPEC CPU2000 and the NAS Parallel Benchmarks suites. The experimental results obtained indicate that, although the migration of data contributes to increase the dynamic energy consumption in Dynamic NUCA caches, the higher IPC achieved permits to save static energy, which dominates the power/energy balance in all the considered architectures. As a consequence, such results would designate NUCA caches as the most performing and energy saving architectures. Besides, according to the obtained results, future power improvements for NUCA caches should concentrate on static energy, while, for the dynamic energy, the on-chip network is the most critical element. Migration of data is acceptable, since it has a positive impact on performance, and the increased dynamic energy is overwhelmed by the static energy savings resulting from the shorter execution time. In order to give a general validity to such statements, we need to explore more design space points for each architecture (by varying the running clock rate and other design parameters) and to evaluate them considering a larger set of benchmarks.

Computer-aided Design | 2012

A real-time configurable NURBS interpolator with bounded acceleration, jerk and chord error

Massimiliano Annoni; Alessandro Bardine; Stefano Campanelli; Pierfrancesco Foglia; Cosimo Antonio Prete

Advances in manufacturing technologies and in machine tools allow for unprecedented quality and efficiency in production lines, but also ask for new and increasing requirements on the motion planning and control systems. The increase of CPU processing power has permitted, in traditional CNC systems, the introduction of NURBS interpolation capabilities, thus determining a further increase in machining quality and efficiency. This has posed new and still unsolved issues, such as the need to satisfy multiple opposite constraints like limiting chord error, acceleration and jerk and offering real-time guarantees. In addition, the ability of privileging the production throughput by relaxing one or more of the previous constraints in a simple way, has emerged as another requirement of modern manufacturing plants. Nevertheless, none of the existing NURBS interpolators have these characteristics. In this work, we propose a NURBS interpolator that is able to satisfy all the manufacturing technology requirements and is able to respect, thanks to its bounded computational complexity, the position control real-time constraints. Such an interpolator is easily reconfigurable, i.e., it can relax some of the constraints while maintaining performances better than previously proposed solutions, and can be adapted in order to include constraints that were not originally considered. Performances of the proposed algorithm have been evaluated both by simulations and by real milling experiments.

embedded systems for real-time multimedia | 2005

A NUCA model for embedded systems cache design

Pierfrancesco Foglia; Daniele Mangano; Cosimo Antonio Prete

Embedded applications require high performance processors integrating fast and low-power cache. Dynamic non-uniform cache architectures (D-NUCA) have been proposed to overcome the performance limit introduced by wire delays when designing large cache. In this paper, we propose alternative designs of D-NUCA cache, namely triangular D-NUCA cache, to reduce power consumption and silicon area occupancy of D-NUCA cache. We compare the performances of triangular D-NUCA cache with conventional rectangular organization. Results show that our approach is particular useful in the embedded applications domain, as it permits the utilization of half-sized NUCA cache with performance improvements.

International Journal of High Performance Systems Architecture | 2010

Way adaptable D-NUCA caches

Alessandro Bardine; Manuel Comparetti; Pierfrancesco Foglia; Giacomo Gabrielli; Cosimo Antonio Prete

Non-uniform cache architecture (NUCA) aims to limit the wire-delay problem typical of large on-chip last level caches: by partitioning a large cache into several banks, with the latency of each one depending on its physical location and by employing a scalable on-chip network to interconnect the banks with the cache controller, the average access latency can be reduced with respect to a traditional cache. The addition of a migration mechanism to move the most frequently accessed data towards the cache controller (D-NUCA) further improves the average access latency. In this work we propose a last-level cache design, based on the D-NUCA scheme, which is able to significantly limit its static power consumption by dynamically adapting to the needs of the running application: the way adaptable D-NUCA cache. This design leads to a fast and power-efficient memory hierarchy with an average reduction by 31.2% in energy-delay product (EDP) with respect to a traditional D-NUCA. We propose and discuss a methodology for tuning the intrinsic parameters of our design and investigate the adoption of the way adaptable D-NUCA scheme as a shared L2 cache in a chip multiprocessor (CMP) system (24% reduction of EDP).

digital systems design | 2008

Leveraging Data Promotion for Low Power D-NUCA Caches

Alessandro Bardine; Manuel Comparetti; Pierfrancesco Foglia; Giacomo Gabrielli; Cosimo Antonio Prete; Per Stenström

D-NUCA caches are cache memories that, thanks to banked organization, broadcast search and promotion/demotion mechanism, are able to tolerate the increasing wire delay effects introduced by technology scaling. As a consequence, they will outperform conventional caches (UCA, Uniform CacheArchitectures) in future generation cores. Due to the promotion/ demotion mechanism, we observed that the distribution of hits across the ways of a D-NUCA cache varies across applications as well as across different execution phases within a single application. In this work, we show how such a behavior can be leveraged to improve the D-NUCA power efficiency as well as to decrease its access latency.In particular, we propose: 1) A new micro-architectural technique to reduce the static power consumption of a D-NUCA cache by dynamically adapting the number of active (i.e. powered-on) ways to the need of the running application; our evaluation shows that a strong reduction of the average number of active ways (37.1%) is achievable, without significantly affecting the IPC (-2.83%), leading to a resultant reduction of the Energy Delay Product (EDP) of 30.9%. 2) A strategy to estimate the characteristic parameters of the proposed technique. 3) An evaluation of the effectiveness of the proposed technique in the multicore environment.

design, automation, and test in europe | 2009

A power-efficient migration mechanism for D-NUCA caches

Alessandro Bardine; Manuel Comparetti; Pierfrancesco Foglia; Giacomo Gabrielli; Cosimo Antonio Prete

D-NUCA L2 caches are able to tolerate the increasing wire delay effects due to technology scaling thanks to their banked organization, broadcast line search and data promotion/demotion mechanism. Data promotion mechanism aims at moving frequently accessed data near the core, but causes additional accesses on cache banks, hence increasing dynamic energy consumption. We shown how, in some cases, this migration mechanism is not successful in reducing data access latency and can be selectively and dynamically inhibited, thus reducing dynamic energy consumption without affecting performances.

IEEE Transactions on Very Large Scale Integration Systems | 2014

Evaluation of Leakage Reduction Alternatives for Deep Submicron Dynamic Nonuniform Cache Architecture Caches

Alessandro Bardine; Manuel Comparetti; Pierfrancesco Foglia; Cosimo Antonio Prete

Wire delays and leakage energy consumption are both growing problems in designing large on-chip caches. Nonuniform cache architecture (NUCA) is a wire-delay aware design paradigm based on the sub-banking of a cache, which allows the banks closer to the controller to be accessed with reduced latencies with respect to the other banks. This feature is leveraged by dynamic NUCA (D-NUCA) caches via a migration mechanism which speeds up frequently used data access, further reducing the effect wire delays have on performance. To reduce leakage power consumption of static random access memory caches, various micro-architectural techniques have been proposed. In this brief, we compare the benefits and limits of the application of some of these techniques to a D-NUCA cache memory, and propose a novel hybrid scheme based on the Drowsy and Way Adaptable techniques. Such a scheme allows further improvement in leakage reduction and limits the impact of process variation on the effectiveness of the Drowsy technique.

symposium on computer architecture and high performance computing | 2009

Analysis of Performance Dependencies in NUCA-Based CMP Systems

Pierfrancesco Foglia; Francesco Panicucci; Cosimo Antonio Prete; Marco Solinas

Improvements in semiconductor nanotechnology have continuously provided a crescent number of faster and smaller per-chip transistors. Consequent classical techniques for boosting performance, such as the increase of clock frequency and the amount of work performed at each clock cycle, can no longer deliver to significant improvement due to energy constrains and wire delay effects. As a consequence, designers interests have shifted toward the implementation of systems with multiple cores per chip (Chip Multiprocessors, CMP). CMP systems typically adopt a large last-level-cache (LLC) shared among all cores, and private L1 caches. As the miss resolution time for private caches depends on the response time of the LLC, which is wire-delay dominated, performance are affected by wire delay. NUCA caches have been proposed for single and multi core systems as a mechanism for such tolerating wire-delay effects on the overall performance. In this paper, we introduce our design for S-NUCA and D-NUCA cache memory systems, and we present an analysis of an 8-cpu CMP system with two levels of cache, in which the L1s are private, while the L2 is a NUCA shared among all cores. We considered two different system topologies (the first with the eight cpus connected to the NUCA at the same side -8p-, the second with half of the cpus on one side and the others at the opposite side -4+4p), and for all the configurations we evaluate the effectiveness of both the static and dynamic policies that have been proposed. Our results show that adopting a D-NUCA scheme with the 8p configuration is the best performing solution among all the considered configurations, and that for the 4+4p configuration the D-NUCA outperforms the S-NUCA in most of the cases. We highlight that performance are tied to both mapping strategy variations (Static and Dynamic) and topology changes. We also observe that bandwidth occupancy depends on both the NUCA policy and topology.

ACM Transactions in Embedded Computing Systems | 2014

Exploiting replication to improve performances of NUCA-based CMP systems

Pierfrancesco Foglia; Marco Solinas

Improvements in semiconductor nanotechnology made chip multiprocessors the reference architecture for high-performance microprocessors. CMPs usually adopt large Last-Level Caches (LLC) shared among cores and private L1 caches, whose performances depend on the wire-delay dominated response time of LLC. NUCA (NonUniform Cache Architecture) caches represent a viable solution for tolerating wire-delay effects. In this article, we present Re-NUCA, a NUCA cache that exploits replication of blocks inside the LLC to avoid performance limitations of D-NUCA caches due to conflicting access to shared data. Results show that a Re-NUCA LLC permits to improve performances of more than 5% on average, and up to 15% for applications that strongly suffer from conflicting access to shared data, while reducing network traffic and power consumption with respect to D-NUCA caches. Besides, it outperforms different S-NUCA schemes optimized with victim replication.

symposium on computer architecture and high performance computing | 2010

Feedback-Driven Restructuring of Multi-threaded Applications for NUCA Cache Performance in CMPs

Sandro Bartolini; Pierfrancesco Foglia; Marco Solinas; Cosimo Antonio Prete

This paper addresses feedback-directed restructuring techniques tuned to Non Uniform Cache Architectures (NUCA) in CMPs running multi-threaded applications. Access time to NUCA caches depends on the location of the referred block, so the locality and cache mapping of the application influence the overall performance. We show techniques for altering the distribution of applications into the cache space as to achieve improved average memory access time. In CMPs running multi-threaded applications, the aggregated accesses (and locality) of the processors form the actual cache load and pose specific issues. We consider a number of Splash-2 and Parsec benchmarks on an 8 processor system and we show that a relatively simple remapping algorithm is able to improve the average Static-NUCA (SNUCA) cache access time by 5.5% and allows an SNUCA cache to surpass the performance of a more complex dynamic-NUCA (DNUCA) for most benchmarks. Then, we present a more sophisticated remapping algorithm, relying on cache geometry information and on the access distribution statistics from individual processors, that reduces the average cache access time by 10.2% and is very stable across all benchmarks.

Explore More