Oreste Villa | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Oreste Villa is active.

Explore More

Publication

Featured researches published by Oreste Villa.

international parallel and distributed processing symposium | 2010

Dynamic load balancing on single- and multi-GPU systems

Long Chen; Oreste Villa; Sriram Krishnamoorthy; Guang R. Gao

The computational power provided by many-core graphics processing units (GPUs) has been exploited in many applications. The programming techniques currently employed on these GPUs are not sufficient to address problems exhibiting irregular, and unbalanced workload. The problem is exacerbated when trying to effectively exploit multiple GPUs concurrently, which are commonly available in many modern systems. In this paper, we propose a task-based dynamic load-balancing solution for single-and multi-GPU systems. The solution allows load balancing at a finer granularity than what is supported in current GPU programming APIs, such as NVIDIAs CUDA. We evaluate our approach using both micro-benchmarks and a molecular dynamics application that exhibits significant load imbalance. Experimental results with a single-GPU configuration show that our fine-grained task solution can utilize the hardware more efficiently than the CUDA scheduler for unbalanced workload. On multi-GPU systems, our solution achieves near-linear speedup, load balance, and significant performance improvement over techniques based on standard CUDA APIs.

IEEE Transactions on Parallel and Distributed Systems | 2008

Efficient Breadth-First Search on the Cell/BE Processor

Daniele Paolo Scarpazza; Oreste Villa; Fabrizio Petrini

Multi-core processors are a shift of paradigm in computer architecture that promises a dramatic increase in performance. But they also bring an unprecedented level of complexity in algorithmic design and software development. In this paper we describe the challenges involved in designing a breadth-first search (BFS) algorithm for the Cell/B.E. processor. The proposed methodology combines a high-level algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processor-specific optimizations. Using a fine-grained global coordination strategy derived by the bulk-synchronous parallel (BSP) model, we have determined an accurate performance model that has guided the implementation and the optimization of our algorithm. Our experiments on a pre-production Cell/B.E. board running at 3.2 GHz, show almost linear speedups when using multiple synergistic processing elements, and an impressive level of performance when compared to other processors. On graphs which offer sufficient parallelism, the Cell/B.E. is typically an order of magnitude faster than conventional processors, such as the AMD Opteron and the Intel Pentium 4 and Woodcrest, and custom-designed architectures, such as the MTA-2 and BlueGene/L.

Journal of Chemical Physics | 2010

Active-space completely-renormalized equation-of-motion coupled-cluster formalism: Excited-state studies of green fluorescent protein, free-base porphyrin, and oligoporphyrin dimer

Karol Kowalski; Sriram Krishnamoorthy; Oreste Villa; Jeffrey R. Hammond; Niranjan Govind

The completely renormalized equation-of-motion coupled-cluster approach with singles, doubles, and noniterative triples [CR-EOMCCSD(T)] has proven to be a reliable tool in describing vertical excitation energies in small and medium size molecules. In order to reduce the high numerical cost of the genuine CR-EOMCCSD(T) method and make noniterative CR-EOMCCSD(T) approaches applicable to large molecular systems, two active-space variants of this formalism [the CR-EOMCCSd(t)-II and CR-EOMCCSd(t)-III methods], based on two different choices of the subspace of triply excited configurations employed to construct noniterative correction, are introduced. In calculations for green fluorescent protein (GFP) and free-base porphyrin, where the CR-EOMCCSD(T) results are available, we show good agreement between the active-space CR-EOMCCSD(T) (variant II) and full CR-EOMCCSD(T) excitation energies. For the oligoporphyrin dimer (P(2)TA) active-space CR-EOMCCSD(T) results provide reasonable agreement with experimentally inferred data. For all systems considered we demonstrated that the active-space CR-EOMCCSD(T) corrections lower the EOMCCSD (iterative equation-of-motion coupled-cluster method with singles and doubles) excitation energies by 0.2 and 0.3 eV, which leads to a better agreement with experiment. We also discuss the quality of basis sets used and compare EOMCC excitation energies with excitation energies obtained with other methods. In particular, we demonstrate that for GFP and FBP Sadlejs TZP and cc-pVTZ basis sets lead to a similar quality of the EOMCC results. The performance of the CR-EOMCCSD(T) implementation is discussed from the point of view of timings of iterative parts and scalability of the most expensive, N(7), part of the calculation. In the latter case the scalability across 34 008 processors is reported.

IEEE Computer | 2008

Accelerating Real-Time String Searching with Multicore Processors

Oreste Villa; Daniele Paolo Scarpazza; Fabrizio Petrini

String searching is at the core of tools used to search, filter, and protect data, but this has become increasingly difficult to do in real time as communication speed grows. The authors present an optimization strategy for a popular algorithm that fully exploits the IBM cell broadband engine architecture to perform exact string matching against large dictionaries and also offer various solutions to alleviate memory congestion.

international parallel and distributed processing symposium | 2007

Peak-Performance DFA-based String Matching on the Cell Processor

Daniele Paolo Scarpazza; Oreste Villa; Fabrizio Petrini

The security of your data and of your network is in the hands of intrusion detection systems, virus scanners and spam filters, which are all critically based on string matching. But network links are getting faster and faster, and string matching is getting more and more difficult to perform in real time. Traditional processors are not keeping up with the performance demands, whereas specialized hardware will never be able to compete with commodity hardware in terms of cost effectiveness, reusability and ease of programming. Advanced multi-core architectures like the IBM Cell Broadband Engine promise unprecedented performance at a low cost, thanks to their popularity and production volume. Nevertheless, the suitability of the cell processor to string matching has not been investigated so far. In this paper we investigate the performance attainable by the cell processor when employed for string matching algorithms based on deterministic finite-state automata (DFA). Our findings show that the cell is an ideal candidate to tackle modern security needs: two processing elements alone, out of the eight available on one cell processor provide sufficient computational power to filter a network link with bit rates in excess of 10 Gbps.

IEEE Transactions on Very Large Scale Integration Systems | 2006

Efficient Synchronization for Embedded On-Chip Multiprocessors

Matteo Monchiero; Gianluca Palermo; Cristina Silvano; Oreste Villa

This paper investigates optimized synchronization techniques for shared memory on-chip multiprocessors (CMPs) based on network-on-chip (NoC) and targeted at future mobile systems. The proposed solution is based on the idea of locally performing synchronization operations requiring continuous polling of a shared variable, thus, featuring large contentions (e.g., spin locks and barriers). A hardware (HW) module, the synchronization-operation buffer (SB), has been introduced to queue and to manage the requests issued by the processors. By using this mechanism, we propose a spin lock implementation requiring a constant number of network transactions and memory accesses per lock acquisition. The SB also supports an efficient implementation of barriers. Experimental validation has been carried out by using GRAPES, a cycle-accurate performance/power simulation platform for multiprocessor systems-on-chip (MPSoCs). Two different architectures have been explored to prove that the proposed approach is effective independently from caches and coherence schemes adopted. For an eight-processor target architecture, we show that the SB-based solution achieves up to 50% performance improvement and 30% energy saving with respect to synchronization based on the caching of the synchronization variables and directory-based coherence protocol. Furthermore, we prove the scalability of the proposed approach when the number of processors increases

Journal of Systems Architecture | 2007

Exploration of distributed shared memory architectures for NoC-based multiprocessors

Matteo Monchiero; Gianluca Palermo; Cristina Silvano; Oreste Villa

Multiprocessor system-on-chip (MP-SoC) platforms represent an emerging trend for embedded multimedia applications. To enable MP-SoC platforms, scalable communication-centric interconnect fabrics, such as networks-on-chip (NoCs), have been recently proposed. The shared memory represents one of the key elements in designing MP-SoCs to provide data exchange and synchronization support. This paper focuses on the energy/delay exploration of a distributed shared memory architecture, suitable for low-power on-chip multiprocessors based on NoC. A mechanism is proposed for the data allocation on the distributed shared memory space, dynamically managed by an on-chip hardware memory management unit (HwMMU). Moreover, the exploitation of the HwMMU primitives for the migration, replication, and compaction of shared data is discussed. Experimental results show the impact of different distributed shared memory configurations for a selected set of parallel benchmark applications from the power/-performance perspective. Furthermore, a case study for a graph exploration algorithm is discussed, accounting for the effects of the core mapping and the network topology on energy and performance at the system level.

compilers, architecture, and synthesis for embedded systems | 2008

Efficiency and scalability of barrier synchronization on NoC based many-core architectures

Oreste Villa; Gianluca Palermo; Cristina Silvano

Interconnects based on Networks-on-Chip are an appealing solution to address future microprocessor designs where, very likely, hundreds of cores will be connected on a single chip. A fundamental role in highly parallelized applications running on many-core architectures will be played by barrier primitives used to synchronize the execution of parallel processes. This paper focuses on the analysis of the efficiency and scalability of different barrier implementations in many-core architectures based on NoCs. Several message passing barrier implementations based on four algorithms (all-to-all, master-slave, butterfly and tree) have been implemented and evaluated for a single-chip target architecture composed of a variable number of cores (from 4 to 128) and different network topologies (mesh, torus, ring, clustered-ring and fat-tree). Using a cycle-accurate simulator, we show the scalability of each barrier for every NoC topology, analyzing and comparing theoretical with real behaviors. We observed that some barrier algorithms, when implemented in hardware or software, show a different scaling behavior with respect to those theoretically expected. We evaluate the efficiency of each combination topology-barrier, demonstrating that, in many cases, simple network topologies can be more efficient than complex and highly connected topologies.

ieee international conference on high performance computing data and analytics | 2014

Scaling the power wall: a path to exascale

Oreste Villa; Daniel R. Johnson; Mike O'Connor; Evgeny Bolotin; David W. Nellans; Justin Luitjens; Nikolai Sakharnykh; Peng Wang; Paulius Micikevicius; Anthony Scudiero; Stephen W. Keckler; William J. Dally

Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signalling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.

computing frontiers | 2010

Efficient pattern matching on GPUs for intrusion detection systems

Antonino Tumeo; Oreste Villa; Donatella Sciuto

In this paper we present an efficient implementation of the Aho-Corasick pattern matching algorithm on Graphics Processing Units (GPU), showing how we redesigned the algorithm and the data structures to fit on the architecture and comparing it with an equivalent implementation on the CPU. We show that with a synthetic dataset, our implementation obtains a speedup up to 6.67 with respect to the CPU solution.

Explore More