Sandro Bartolini | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sandro Bartolini is active.

Explore More

Publication

Featured researches published by Sandro Bartolini.

ACM Journal on Emerging Technologies in Computing Systems | 2014

Design Options for Optical Ring Interconnect in Future Client Devices

Paolo Grani; Sandro Bartolini

Nanophotonic is a promising solution for on-chip interconnection due to its intrinsic low-latency and low-power features. Future tiled chip multiprocessors (CMPs) for rich client devices can receive energy benefits from this technology but we show that great care has to be put in the integration of the various involved facets to avoid queuing and serialization issues and obtain the rated potential advantages. We evaluate different management strategies for accessing a simple, shared photonic path (ring), working in conjunctions with a standard electronic mesh or alone, in a tiled CMP. Our results highlight that a careful selection of the most latency-critical messages to be routed in photonics and the use of a conflict-free access scheme is crucial for obtaining performance/power advantages when the available bandwidth is limited. We identify the design point where all the traffic can be routed on the photonic path and thus the electronic network can be suppressed. At this point, the ring achieves 20--25% speedup and 84% energy consumption improvement over the electronic baseline. Then we investigate the same trade-offs when the number of rings is increased up to eight, allowing to raise performance benefits up to 40% or reaching up to 80% energy reduction. We finally explore the effects of deploying a given optical parallelism split between a higher number of waveguides for further improving energy savings.

design, automation, and test in europe | 2013

Contrasting wavelength-routed optical NoC topologies for power-efficient 3D-stacked multicore processors using physical-layer analysis

Luca Ramini; Paolo Grani; Sandro Bartolini; Davide Bertozzi

Optical networks-on-chip (ONoCs) are currently still in the concept stage, and would benefit from explorative studies capable of bridging the gap between abstract analysis frameworks and the constraints and challenges posed by the physical layer. This paper aims to go beyond the traditional comparison of wavelength-routed ONoC topologies based only on their abstract properties, and for the first time assesses their physical implementation efficiency in an homogeneous experimental setting of practical relevance. As a result, the paper can demonstrate the significant and different deviation of topology layouts from their logic schemes under the effect of placement constraints on the target system. This becomes then the preliminary step for the accurate characterization of technology-specific metrics such as the insertion loss critical path, and to derive the ultimate impact on power efficiency and feasibility of each design.

IEEE Transactions on Computers | 2008

Effects of Instruction-Set Extensions on an Embedded Processor: A Case Study on Elliptic Curve Cryptography over GF(2/sup m/)

Sandro Bartolini; Irina Branovic; Roberto Giorgi; Enrico Martinelli

Elliptic-Curve cryptography (ECC) is promising for enabling information security in constrained embedded devices. In order to be efficient on a target architecture, ECCs require accurate choice/tuning of the algorithms that perform the underlying mathematical operations. This paper contributes with a cycle-level analysis of the dependencies of ECC performance from the interaction between the features of the mathematical algorithms and the actual architectural and microarchitectural features of an ARM-based Intel XScale processor. Another contribution is the cycle-level analysis of a modified ARM processor that includes a word-level finite field polynomial multiplier (poly_mul) in its data path. This extension constitutes a good trade-off between applicability in a number of contexts, the simplicity of integration within the processor, and performance. This paper points out the most advantageous mix of elliptic curve (EC) parameters both for the standard ARM-based Intel XScale platform and for the one equipped with the polyjnul unit. In particular, the latter case allows for more than 41 percent execution time reduction on the considered benchmarks. Last, this paper investigates the correlation between the possible architectural organizations of a processor equipped with poly_mul unit(s) and EC benchmark performance. For instance, only superscalar pipelines can exploit the features of out-of-order execution and only very complex organizations (for example, four way superscalar) can exploit a high number of available ALUs. Conversely, we show that there are no benefits in endowing the processor with more than one poly_mul, and we point out a possible trade-off between performance and complexity increase: A two-way in-order/out-of-order pipeline allows +50 percent and +90 percent of Instructions per Cycle (IPC), respectively. Finally, we show that there are no critical constraints on the latency and pipelining capability of the polyjnul unit for the basic EC point multiplication.

symposium on computer architecture and high performance computing | 2004

A performance evaluation of ARM ISA extension for elliptic curve cryptography over binary finite fields

Sandro Bartolini; Irina Branovic; Roberto Giorgi; Enrico Martinelli

In this paper, we present an evaluation of possible ARM instruction set extension for elliptic curve cryptography (ECC) over binary finite fields GF(2/sup m/). The use of elliptic curve cryptography is becoming common in embedded domain, where its reduced key size at a security level equivalent to standard public-key methods (such as RSA) allows for power consumption savings and more efficient operation. ARM processor was selected because it is widely used for embedded system applications. We developed an ECC benchmark set with three widely used public-key algorithms: Diffie-Hellman for key exchange, digital signature algorithm, as well as El-Gamal method for encryption/decryption. We analyzed the major bottlenecks at function level and evaluated the performance improvement, when we introduce some simple architectural support in the ARM ISA. Results of our experiments show that the use of a word-level multiplication instruction over binary field allows for an average 33% reduction of the total number of dynamically executed instructions, while execution time improves by the same amount when projective coordinates are used.

design, automation, and test in europe | 2014

Assessing the energy break-even point between an optical NoC architecture and an aggressive electronic baseline

Luca Ramini; Paolo Grani; Hervé Tatenguem Fankem; Alberto Ghiribaldi; Sandro Bartolini; Davide Bertozzi

Many crossbenchmarking results reported in the open literature raise optimistic expectations on the use of optical networks-on-chip (ONoCs) for high-performance and low-power on-chip communication. However, most of those previous works ultimately fail to make a compelling case for chip-level nanophotonic NoCs, especially for the lack of aggressive electronic baselines (ENoC), and the poor accuracy in physical- and architecture-layer analysis of the ONoC. This paper aims at providing the guidelines and minimum requirements so that nanophotonic emerging technology may become of practical relevance. The key differentiating factor of this work consists of contrasting ONoC solutions with an aggressive ENoC architecture with realistic complexity, performance, and power figures, synthesized on an industrial 40nm low-power technology. At the same time, key physical design issues and network interface architecture requirements for the ONoC under test are carefully assessed, thus paving the way for a well-grounded definition of the requirements for the emerging ONoC technology to achieve the energy break-even point with respect to pure electronic interconnect solutions in future multi- and many-core systems.

digital systems design | 2012

A Simple On-Chip Optical Interconnection for Improving Performance of Coherency Traffic in CMPs

Sandro Bartolini; Paolo Grani

Nanophotonic interconnection is a promising solution for inter-core communication in future chip multiprocessors (CMPs). Main benefits derive from its intrinsic low-latency and high-bandwidth, especially when employing wavelength division multiplexing (WDM), as well as reduced power requirements when compared to electronic NoCs. Existing works on optical NoCs (ONoC) mainly concentrate on relatively complex proposals needed to host the whole CMP traffic. In some proposals complexity is increased also from the need of an electronic network for preliminary pathsetup in the optical one. This paper proposes to enhance a conventional NoC with only a simple photonic structure, a ring, and aims at investigating its suitability to support the low-latency transmission of small latency-critical coherency control messages as to improve performance of multithreaded applications. In particular, our proposed scheme supports fast multicast transmission of invalidation messages. We have simulated Parsec benchmarks on an 8 core full-system CMP. Results show that a careful selection of coherency control messages to be forwarded to the photonic ring allows improving execution time up to 19%, with an average of 6% across all considered benchmarks. We discuss how different selections of messages, i.e. related to read and/or write operations, affect results and single out the most profitable set. Moreover, we show that the sharing behavior of benchmarks has a central role in the final performance.

design, automation, and test in europe | 2008

Instruction cache energy saving through compiler way-placement

Timothy M. Jones; Sandro Bartolini; Bruno De Bus; John Cavazos; Michael F. P. O'Boyle

Fetching instructions from a set-associative cache in an embedded processor can consume a large amount of energy due to the tag checks performed. Recent proposals to address this issue involve predicting or memoizing the correct way to access. However, they also require significant hardware storage which negates much of the energy saving. This paper proposes way-placement to save instruction cache energy. The compiler places the most frequently executed instructions at the start of the binary and at runtime these are mapped to explicit ways within the cache. We compare with a state-of-the-art hardware technique and show that our scheme saves almost 50% of the instruction cache energy compared to 32% for the hardware approach. We report results on a variety of cache sizes and associativities, achieving 59% instruction cache energy savings and an ED product of 0.80 in the best configuration with negligible hardware overhead and no ISA changes.

symposium on computer architecture and high performance computing | 2010

Feedback-Driven Restructuring of Multi-threaded Applications for NUCA Cache Performance in CMPs

Sandro Bartolini; Pierfrancesco Foglia; Marco Solinas; Cosimo Antonio Prete

This paper addresses feedback-directed restructuring techniques tuned to Non Uniform Cache Architectures (NUCA) in CMPs running multi-threaded applications. Access time to NUCA caches depends on the location of the referred block, so the locality and cache mapping of the application influence the overall performance. We show techniques for altering the distribution of applications into the cache space as to achieve improved average memory access time. In CMPs running multi-threaded applications, the aggregated accesses (and locality) of the processors form the actual cache load and pose specific issues. We consider a number of Splash-2 and Parsec benchmarks on an 8 processor system and we show that a relatively simple remapping algorithm is able to improve the average Static-NUCA (SNUCA) cache access time by 5.5% and allows an SNUCA cache to surpass the performance of a more complex dynamic-NUCA (DNUCA) for most benchmarks. Then, we present a more sophisticated remapping algorithm, relying on cache geometry information and on the access distribution statistics from individual processors, that reduces the average cache access time by 10.2% and is very stable across all benchmarks.

ACM Transactions in Embedded Computing Systems | 2005

Optimizing instruction cache performance of embedded systems

Sandro Bartolini; Cosimo Antonio Prete

In the embedded domain, the gap between memory and processor performance and the increase in application complexity need to be supported without wasting precious system resources: die size, power, etc. For these reasons, effective exploitation of small and simple cache memories is of the utmost importance. However, programs running on such caches can experience serious inefficiencies due to cache conflicts.We present a new Cache-Aware Code Allocation Technique (CAT), which transforms the structure of programs so that their behavior toward memory can meet the locality features the cache is able to exploit. The proposed approach uses detailed information of program execution to place program areas into memory and employs the new idea of “look-forward estimation” that helps to seek better global layouts during the placement of each area. CAT-optimized programs outperform the original ones achieving the same miss rate on two times, and sometimes four times, smaller caches. Moreover, CAT improves the instruction miss rate by more than 40% if compared to the best procedure-reordering algorithm. CAT performances derive from the increased number of cache lines that support the execution of optimized applications and from a more balanced load on them.

ACM | 2008