Rosario Cammarota | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Rosario Cammarota is active.

Explore More

Publication

Featured researches published by Rosario Cammarota.

international symposium on microarchitecture | 2012

Improving Cache Management Policies Using Dynamic Reuse Distances

Nam Duong; Dali Zhao; Taesu Kim; Rosario Cammarota; Mateo Valero; Alexander V. Veidenbaum

Cache management policies such as replacement, bypass, or shared cache partitioning have been relying on data reuse behavior to predict the future. This paper proposes a new way to use dynamic reuse distances to further improve such policies. A new replacement policy is proposed which prevents replacing a cache line until a certain number of accesses to its cache set, called a Protecting Distance (PD). The policy protects a cache line long enough for it to be reused, but not beyond that to avoid cache pollution. This can be combined with a bypass mechanism that also relies on dynamic reuse analysis to bypass lines with less expected reuse. A miss fetch is bypassed if there are no unprotected lines. A hit rate model based on dynamic reuse history is proposed and the PD that maximizes the hit rate is dynamically computed. The PD is recomputed periodically to track a programs memory access behavior and phases. Next, a new multi-core cache partitioning policy is proposed using the concept of protection. It manages lifetimes of lines from different cores (threads) in such a way that the overall hit rate is maximized. The average per-thread lifetime is reduced by decreasing the threads PD. The single-core PD-based replacement policy with bypass achieves an average speedup of 4.2% over the DIP policy, while the average speedups over DIP are 1.5% for dynamic RRIP (DRRIP) and 1.6% for sampling dead-block prediction (SDP). The 16-core PD-based partitioning policy improves the average weighted IPC by 5.2%, throughput by 6.4% and fairness by 9.9% over thread-aware DRRIP (TA-DRRIP). The required hardware is evaluated and the overhead is shown to be manageable.

computing frontiers | 2011

Pruning hardware evaluation space via correlation-driven application similarity analysis

Rosario Cammarota; Arun Kejariwal; Paolo D'Alberto; Sapan Panigrahi; Alexander V. Veidenbaum; Alexandru Nicolau

System evaluation is routinely performed in industry to select one amongst a set of different systems to improve performance of proprietary applications. However, a wide range of system configurations is available every year on the market. This makes an exhaustive system evaluation progressively challenging and expensive. In this paper we propose a novel similarity-based methodology for system selection. Our methodology prunes the set of candidate systems by eliminating those systems that are likely to reduce performance of a given proprietary application. The pruning process relies on applications that are similar to a given application of interest whose performance on the candidte systems is known. This obviates the need to install and run the given application on each and every candidate system. The concept of similarity we introduce is performance centric. For a given application, we compute the Pearsons correlation between different types of resource stall and cycles per instruction. We refer to the vector of Pearsons correlation coefficients as an application signature. Next, we assess similarity between two applications as Spearmans correlation between their respective signature. We use the former type of correlation to quantify the association between pipeline stalls and cycles per instruction, whereas we use the latter type of correlation to quantify the association of two signatures, hence to assess similarity, based on the difference in terms of rank ordering of their components. We evaluate the proposed methodology on three different micro-architectures, viz., Intels Harpertown, Nehalem and Westmere, using industry-standard SPEC CINT2006. We assess performance centric similarity among applications in SPEC CINT2006. We show how our methodology clusters applications with common performance issues. Finally, we show how to use the notion of similarity among applications to compare the three architectures with respect to a given Yahoo! property.

embedded systems for real time multimedia | 2015

WebRTCbench: a benchmark for performance assessment of webRTC implementations

Sajjad Taheri; Laleh Aghababaie Beni; Alexander V. Veidenbaum; Alexandru Nicolau; Rosario Cammarota; Jianlin Qiu; Qiang Lu; Mohammad R. Haghighat

WebRTC is an HTML5 API that allows browsers to establish a peer-to-peer connection for transferring data and media content via JavaScript APIs. This functionality enables broad range of new applications to emerge and is going to revolutionize Web communication. However, this technology is still under development and standardization process. Hence, detecting performance bottlenecks of different implementations across operating systems and architectures can help improve it significantly, and a benchmark suite would be a great help to accomplish this task. In this paper, we present WebRTCBench, a benchmark which measures WebRTC peer connection establishment and communication performance. We present and discuss performance evaluation of WebRTC implementations across a range of implementations and devices. This benchmark is publicly available under GPL license.

compiler construction | 2013

On the determination of inlining vectors for program optimization

Rosario Cammarota; Alexandru Nicolau; Alexander V. Veidenbaum; Arun Kejariwal; Debora Donato; Mukund Madhugiri

In this paper we propose a new technique and a framework to select inlining heuristic constraints - referred to as an inlining vector, for program optimization. The proposed technique uses machine learning to model the correspondence between inlining vectors and performance (completion time). The automatic selection of a machine learning algorithm to build such a model is part of our technique and we present a rigorous selection procedure. Subject to a given architecture, such a model evaluates the benefit of inlining combined with other global optimizations and selects an inlining vector that, in the limits of the model, minimizes the completion time of a program. We conducted our experiments using the GNU GCC compiler and optimized 22 combinations (program, input) from SPEC CINT2006 on the state-of-the-art Intel Xeon Westmere architecture. Compared with optimization level, i.e., -O3, our technique yields performance improvements ranging from 2% to 9%.

ieee international conference on high performance computing, data, and analytics | 2012

A fault tolerant self-scheduling scheme for parallel loops on shared memory systems

Yizhuo Wang; Alexandru Nicolau; Rosario Cammarota; Alexander V. Veidenbaum

As the number of cores per chip increases, significant speedup for many applications could be achieved by exploiting loop level parallelism (LLP). Meanwhile, ever scaling device size makes multicore/multiprocessor systems suffer from increased reliability problems. Scheduling scheme plays a key role to exploit LLP. In existing dynamic loop scheduling schemes, self-scheduling is the most commonly used scheme1. This paper presents FTSS, a fault tolerant self-scheduling scheme which aims to execute parallel loops efficiently in the presence of hardware faults on shared memory systems. Our technique transforms a loop to ensure the correctness of the re-execution of loop iterations by buffering variables with anti-dependences, which make it possible to design a fault tolerant loop scheduling scheme without checkpointing. FTSS combines work-stealing with self-scheduling, and uses a bidirectional execution model when work is stolen from a faulty core. Experimental results show that FTSS achieve better load balancing than existing self-scheduling schemes. Compared with checkpoint/restart implementations that save a checkpoint before executing each chunk of iterations and restart the whole chunk running on a faulty core, FTSS exhibits better runtime performance. In addition, FTSS greatly outperforms existing self-scheduling schemes in terms of performance and stability in heavy loaded runtime environment.

international symposium on parallel and distributed computing | 2013

Effective Evaluation of Multi-core Based Systems

Rosario Cammarota; Laleh Aghababaie Beni; Alexandru Nicolau; Alexander V. Veidenbaum

This work proposes a practical technique to reduce the evaluation cost of multi-core based systems, when these systems are evaluated with parallel benchmarks. The proposed technique highlights the amount of redundancy in a set of parallel benchmarks and reduces this set to a subset of benchmarks such that: (i) the selected benchmarks are representative or non-redundant - i.e., the series of performance attained by any couple of representative benchmarks on different systems significantly differ, (ii) system evaluation is executed efficiently - i.e., on the system under evaluation, the average performance of representative benchmarks closely approaches the average performance of the whole suite. The proposed technique is validated with the industry-standard benchmark suites SPEC OMP2001 and SPEC OMP2012 on the largest data set of systems publicly available on the SPEC website - until the last quarter of the year 2012. For each suite, the proposed technique (i) identifies a subset of representative- benchmarks and (ii) shows how this subset of representative benchmarks - ≈ 50% of the total number of benchmarks - can be deployed to evaluate multi-core based systems with a prediction errors <; 5% at 99% confidence level.

computing frontiers | 2012

Selective search of inlining vectors for program optimization

Rosario Cammarota; Arun Kejariwal; Debora Donato; Alexandru Nicolau; Alexander V. Veidenbaum

We propose a novel technique to select the inlining options of a compiler - referred to as an inlining vector, for program optimization. The proposed technique trains a machine learning algorithm to model the relation between inlining vectors and performance (completion time). The training set is composed of sample runs of the programs to optimize - that are compiled with a limited number of inlining vectors. Subject to a given compiler, the model evaluates the benefit of inlining combined with other compiler heuristics. The model is subsequently used to select the inlining vector which minimizes the predicted completion time of a program with respect to a given level of optimization. We present a case study based on the compiler GNU GCC. We used our technique to improve performance of 403.gcc from SPEC CPU2006 - a program which is notoriously hard to optimize - with respect to the optimization level -O3 as the baseline. On the state-of-the-art Intel Xeon Westmere architecture, 403.gcc, compiled using the inlining vectors selected by our technique, outperforms the baseline by up to 9%.

design automation conference | 2018

Protecting the supply chain for automotives and IoTs

Sandip Ray; Wen Chen; Rosario Cammarota

Modern automotive systems and IoT devices are designed through a highly complex, globalized, and potentially untrustworthy supply chain. Each player in this supply chain may (1) introduce sensitive information and data (collectively termed “assets”) that must be protected from other players in the supply chain, and (2) have controlled access to assets introduced by other players. Furthermore, some players in the supply chain may be malicious. It is imperative to protect the device and any sensitive assets in it from being compromised or unknowingly disclosed by such entities. A key – and sometimes overlooked – component of security architecture of modern electronic systems entails managing security in the face of supply chain challenges. In this paper we discuss some security challenges in automotive and IoT systems arising from supply chain complexity, and the state of the practice in this area.

computing frontiers | 2018

VPsec: countering fault attacks in general purpose microprocessors with value prediction

Rosario Cammarota; Rami Sheikh

Despite their complexity, general purpose microprocessors are susceptible to fault attacks. The state-of-the-art fault attacks rely on a precise understanding of the microprocessor datapath and the instructions critical path, to identify the exact time and location for injecting data faults that affect only targeted instructions in the pipeline. Software-only mitigations are only partially effective to defend against such attacks, whereas existing hardware-assisted mitigations require substantial changes to the microprocessor design. Both types of mitigation introduce significant overheads to the application memory footprint, the microprocessor area, or impact the overall system performance. We propose a novel hardware-only scheme: Value Prediction for security (VPSec). VPsec leverages value prediction in an embodiment and system design to mitigate fault attacks in general purpose microprocessors. Value prediction is an elegant and hitherto mature microarchitectural performance optimization, which aims to predict the data value ahead of the data production with high prediction accuracy and coverage. VPsec leverages the presence of the state-of-the-art value prediction in a general purpose microprocessors, and re-architects it for security. It augments the original value prediction embodiment with fault detection logic and reaction logic to mitigate fault attacks to both the datapath and the value predictor itself. VPsec defines a new mode of execution in which the predicted value is trusted rather than the produced value. From a design perspective, VPsec requires minimal hardware changes (negligible area impact) with respect to a baseline that supports value prediction, it has no software overheads (no increase in memory footprint), and it retains most of the performance benefits of value prediction. Our evaluation of VPsec demonstrates its efficacy in countering fault attacks as well as its ability to retain the performance benefits of value prediction on cryptographic and non-cryptographic workloads.

Cryptography | 2018

Improving Performance and Mitigating Fault Attacks Using Value Prediction

Rami Sheikh; Rosario Cammarota

We present Value Prediction for Security (VPsec), a novel hardware-only framework to counter fault attacks in modern microprocessors, while preserving the performance benefits of Value Prediction (VP.) VP is an elegant and hitherto mature microarchitectural performance optimization, which aims to predict the data value ahead of the data production with high prediction accuracy and coverage. Instances of VPsec leverage the state-of-the-art Value Predictors in an embodiment and system design to mitigate fault attacks in modern microprocessors. Specifically, VPsec implementations re-architect any baseline VP embodiment with fault detection logic and reaction logic to mitigate fault attacks to both the datapath and the value predictor itself. VPsec also defines a new mode of execution in which the predicted value is trusted rather than the produced value. From a microarchitectural design perspective, VPsec requires minimal hardware changes (negligible area and complexity impact) with respect to a baseline that supports VP, it has no software overheads (no increase in memory footprint or execution time), and it retains most of the performance benefits of VP under realistic attacks. Our evaluation of VPsec demonstrates its efficacy in countering fault attacks, as well as its ability to retain the performance benefits of VP on cryptographic workloads, such as OpenSSL, and non-cryptographic workloads, such as SPEC CPU 2006/2017.

Explore More