Is this you? Create Your Porfile

Antonino Tumeo

Pacific Northwest National Laboratory

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Antonino Tumeo is active.

Explore More

Publication

Featured researches published by Antonino Tumeo.

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems | 2010

Ant Colony Heuristic for Mapping and Scheduling Tasks and Communications on Heterogeneous Embedded Systems

Fabrizio Ferrandi; Pier Luca Lanzi; Christian Pilato; Donatella Sciuto; Antonino Tumeo

To exploit the power of modern heterogeneous multiprocessor embedded platforms on partitioned applications, the designer usually needs to efficiently map and schedule all the tasks and the communications of the application, respecting the constraints imposed by the target architecture. Since the problem is heavily constrained, common methods used to explore such design space usually fail, obtaining low-quality solutions. In this paper, we propose an ant colony optimization (ACO) heuristic that, given a model of the target architecture and the application, efficiently executes both scheduling and mapping to optimize the application performance. We compare our approach with several other heuristics, including simulated annealing, tabu search, and genetic algorithms, on the performance to reach the optimum value and on the potential to explore the design space. We show that our approach obtains better results than other heuristics by at least 16% on average, despite an overhead in execution time. Finally, we validate the approach by scheduling and mapping a JPEG encoder on a realistic target architecture.

computing frontiers | 2010

Efficient pattern matching on GPUs for intrusion detection systems

Antonino Tumeo; Oreste Villa; Donatella Sciuto

In this paper we present an efficient implementation of the Aho-Corasick pattern matching algorithm on Graphics Processing Units (GPU), showing how we redesigned the algorithm and the data structures to fit on the architecture and comparing it with an equivalent implementation on the CPU. We show that with a synthetic dataset, our implementation obtains a speedup up to 6.67 with respect to the CPU solution.

ieee computer society annual symposium on vlsi | 2007

A Pipelined Fast 2D-DCT Accelerator for FPGA-based SoCs

Antonino Tumeo; Matteo Monchiero; Gianluca Palermo; Fabrizio Ferrandi; Donatella Sciuto

Multimedia applications, and in particular the encoding and decoding of standard image and video formats, are usually a typical target for systems-on-chip (SoC). The bi-dimensional discrete cosine transformation (2D-DCT) is a commonly used frequency transformation in graphic compression algorithms. Many hardware implementations, adopting disparate algorithms, have been proposed for field programmable gate arrays (FPGA). These designs focus either on performance or area, and often do not succeed in balancing the two aspects. In this paper, we present a design of a fast 2D-DCT hardware accelerator for a FPGA-based SoC. This accelerator makes use of a single seven stages 1D-DCT pipeline able to alternate computation for the even and odd coefficients in every cycle. In addition, it uses special memories to perform the transpose operations. Our hardware takes 80 clock cycles at 107MHz to generate a complete 8times8 2D DCT, from the writing of the first input sample to the reading of the last result (including the overhead of the interface logic). We show that this architecture provides optimal performance/area ratio with respect to several alternative designs.

international conference on embedded computer systems: architectures, modeling, and simulation | 2008

Ant colony optimization for mapping and scheduling in heterogeneous multiprocessor systems

Antonino Tumeo; Christian Pilato; Fabrizio Ferrandi; Donatella Sciuto; Pier Luca Lanzi

Heterogeneous multiprocessor systems, assembled with off-the-shelf processors and augmented with reprogrammable devices, thanks to their performance, cost effectiveness and flexibility, have become a standard platform for embedded systems. To fully exploit the computational power offered by these systems, great care should be taken when deciding on which processing element (mapping) and when (scheduling) executing the program tasks. Unfortunately, both these problems are NP-complete, and, even if they are strictly interconnected, they are normally performed separately with exact or heuristic algorithms to simplify the search for the optimum points. In this paper we present an exploration algorithm based on Ant Colony Optimization (ACO) that tries to solve the two problems simultaneously. We propose an implementation of the algorithm that gradually constructs feasible solution instances and searches around them rather than exploring a structure that already considers all the possible solutions. We introduce a two-stage decision mechanism that simplifies the data structures but lets the ant perform correlated choices for both the mapping and the scheduling. We show that this algorithm provides better and more robust solutions in less time than the Simulated Annealing and the Tabu Search algorithms, extended to support the combined scheduling and mapping problems. In particular, our ACO formulation can find, on average, solutions between 64% and 55% better than Simulated Annealing and Tabu Search.

great lakes symposium on vlsi | 2007

A design kit for a fully working shared memory multiprocessor on FPGA

Antonino Tumeo; Matteo Monchiero; Gianluca Palermo; Fabrizio Ferrandi; Donatella Sciuto

This paper presents a framework to design a shared memory multiprocessor on a programmable platform. We propose a complete flow, composed by a programming model and a template architecture. Our framework permits to write a parallel application by using a shared memory model. It deals with the consistency of shared data, with no need of hardware coherence protocol, but uses a software model to properlyallsynchronize the local copies with the shared memory image. This idea can be applied both to a scratchpad-based architecture or a cache-based one. The architecture is synthesizable with standard IPs, such as the softcores and interconnect elements, which may be found in any commercial FPGA toolset.

symposium on application specific processors | 2010

Accelerating DNA analysis applications on GPU clusters

Antonino Tumeo; Oreste Villa

DNA analysis is an emerging application of high performance bioinformatics. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs). We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.

asia and south pacific design automation conference | 2010

Mapping and scheduling of parallel C applications with ant colony optimization onto heterogeneous reconfigurable MPSoCs

Fabrizio Ferrandi; Christian Pilato; Donatella Sciuto; Antonino Tumeo

Efficient mapping and scheduling of partitioned applications are crucial to improve the performance on todays reconfigurable multiprocessor systems-on-chip (MPSoCs) platforms. Most of existing heuristics adopt the Directed Acyclic (task) Graph as representation, that unfortunately, is not able to represent typical embedded applications (e.g., real-time and loop-partitioned). In this paper we propose a novel approach, based on Ant Colony Optimization, that explores different alternative designs to determine an efficient hardware-software partitioning, to decide the task allocation and to establish the execution order of the tasks, dealing with different design constraints imposed by a reconfigurable heterogeneous MPSoC. Moreover, it can be applied to any parallel C application, represented through Hierarchical Task Graphs. We show that our methodology, addressing a realistic target architecture, outperforms existing approaches on a representative set of embedded applications.

IEEE Transactions on Parallel and Distributed Systems | 2012

Aho-Corasick String Matching on Shared and Distributed-Memory Parallel Architectures

Antonino Tumeo; Oreste Villa; Daniel G. Chavarría-Miranda

String matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. This paper compares several software-based implementations of the Aho-Corasick algorithm for high-performance systems. We focus on the matching of unknown inputs streamed from a single source, typical of security applications and difficult to manage since the input cannot be preprocessed to obtain locality. We consider shared-memory architectures (Niagara 2, x86 multiprocessors, and Cray XMT) and distributed-memory architectures with homogeneous (InfiniBand cluster of x86 multicores) or heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C1060 GPUs). We describe how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.

international conference on embedded computer systems: architectures, modeling, and simulation | 2007

An Interrupt Controller for FPGA-based Multiprocessors

Antonino Tumeo; Marco Branca; Lorenzo Camerini; Matteo Monchiero; Gianluca Palermo; Fabrizio Ferrandi; Donatella Sciuto

Interrupt-based programming is widely used for interfacing a processor with peripherals and allowing software threads to interact. Many hardware/software architectures have been proposed in the past to support this kind of programming practice. In the context of FPGA-based multiprocessors this topic has not been thoroughly faced yet. This paper presents the architecture of an interrupt controller for a FPGA-based multiprocessor composed of standard off-of-the-shelf softcores. The main feature of this device is to distribute multiple interrupts across the cores of a multiprocessor. In addition, our architecture supports several advanced features like booking, broadcasting and inter-processor interrupt. On the top of this hardware layer, we provide a software library to effectively exploit this mechanism. We realized a prototype of this system. Our experiments show that our interrupt controller efficiently distributes multiple interrupts on the system.

design, automation, and test in europe | 2008

A dual-priority real-time multiprocessor system on FPGA for automotive applications

Antonino Tumeo; Marco Branca; Lorenzo Camerini; Marco Ceriani; Gianluca Palermo; Fabrizio Ferrandi; Donatella Sciuto; Matteo Monchiero

This paper presents the implementation of a dual-priority scheduling algorithm for real-time embedded systems on a shared memory multiprocessor on FPGA. The dual-priority microkernel is supported by a multiprocessor interrupt controller to trigger periodic and aperiodic thread activation and manage context switching. We show how the dual-priority algorithm performs on a real system prototype compared to the theoretical performance simulations with a typical standard workload of automotive applications, underlining where the differences are.

Explore More