Davide Pasetto | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Davide Pasetto is active.

Explore More

Publication

Featured researches published by Davide Pasetto.

ieee international conference on high performance computing data and analytics | 2010

Scalable Graph Exploration on Multicore Processors

Virat Agarwal; Fabrizio Petrini; Davide Pasetto; David A. Bader

Many important problems in computational sciences, social network analysis, security, and business analytics, are data-intensive and lend themselves to graph-theoretical analyses. In this paper we investigate the challenges involved in exploring very large graphs by designing a breadth-first search (BFS) algorithm for advanced multi-core processors that are likely to become the building blocks of future exascale systems. Our new methodology for large-scale graph analytics combines a highlevel algorithmic design that captures the machine-independent aspects, to guarantee portability with performance to future processors, with an implementation that embeds processorspecific optimizations. We present an experimental study that uses state-of-the-art Intel Nehalem EP and EX processors and up to 64 threads in a single system. Our performance on several benchmark problems representative of the power-law graphs found in real-world problems reaches processing rates that are competitive with supercomputing results in the recent literature. In the experimental evaluation we prove that our graph exploration algorithm running on a 4-socket Nehalem EX is (1) 2.4 times faster than a Cray XMT with 128 processors when exploring a random graph with 64 million vertices and 512 millions edges, (2) capable of processing 550 million edges per second with an R-MAT graph with 200 million vertices and 1 billion edges, comparable to the performance of a similar graph on a Cray MTA-2 with 40 processors and (3) 5 times faster than 256 BlueGene/L processors on a graph with average degree 50.

IEEE Computer | 2010

Tools for Very Fast Regular Expression Matching

Davide Pasetto; Fabrizio Petrini; Virat Agarwal

Regular expressions, or regex, are a common choice for defining configurable rules for data parsing because of their expressiveness in detecting recurrent patterns and information. For many data-intensive applications, regex matching is the first line of defense in performing online data filtering. Unfortunately, few solutions can keep up with the increasing data rates and the complexity posed by sets with hundreds of expressions. DotStar addresses this problem by providing a complete algorithmic solution and a software tool chain that can compile large sets of user-provided regex first into a sequence of intermediate representations and then into an automaton that can search for matches in a single pass without backtracking. The entire software tool chain supports the extended Posix standard syntax for regex.

conference on object-oriented programming systems, languages, and applications | 2011

A comparative study of parallel sort algorithms

Davide Pasetto; Albert Akhriev

In this paper we examine the performance of parallel sorting algorithms on modern multi-core hardware. Several general-purpose methods, with particular interest in sorting of database records and huge arrays, are evaluated and a brief analysis is provided.

ieee international symposium on parallel distributed processing workshops and phd forum | 2010

Streaming, low-latency communication in on-line trading systems

Hari Subramoni; Fabrizio Petrini; Virat Agarwal; Davide Pasetto

This paper presents and evaluates the performance of a prototype of an on-line OPRA data feed decoder. Our work demonstrates that, by using best-in-class commodity hardware, algorithmic innovations and careful design, it is possible to obtain the performance of custom-designed hardware solutions. Our prototype system integrates the latest Intel Nehalem processors and Myricom 10 Gigabit Ethernet technologies with an innovative algorithmic design based on the DotStar compilation tool. The resulting system can provide low latency, high bandwidth and the flexibility of commodity components in a single framework, with an end-to-end latency of less then four microseconds and an OPRA feed processing rate of almost 3 million messages per second per core, with a packet payload of only 256 bytes.

IEEE Computer Architecture Letters | 2010

Intra-Socket and Inter-Socket Communication in Multi-core Systems

Hari Subramoni; Fabrizio Petrini; Virat Agarwal; Davide Pasetto

The increasing computational and communication demands of the scientific and industrial communities require a clear understanding of the performance trade-offs involved in multi-core computing platforms. Such analysis can help application and toolkit developers in designing better, topology aware, communication primitives intended to suit the needs of various high end computing applications. In this paper, we take on the challenge of designing and implementing a portable intra-core communication framework for streaming computing and evaluate its performance on some popular multi-core architectures developped by Intel, AMD and Sun. Our experimental results, obtained on the Intel Nehalem, AMD Opteron and Sun Niagara 2 platforms, show that we are able to achieve an intra-socket small message latency between 120 and 271 nanoseconds depending on the hardware platform, while the inter-socket small message latency is between 218 and 320 nanoseconds. The maximum intra-socket communication bandwidth ranges from 0.179 (Sun Niagara 2) to 6.5 (Intel Nehalem) GBytes/s. We were also able to obtain an inter-socket communication performance of 1.2 and 6.6 GBytes/s for AMD Opteron and Intel Nehalem, respectively.

high performance interconnects | 2009

Fulcrum's FocalPoint FM4000: A Scalable, Low-Latency 10GigE Switch for High-Performance Data Centers

Uri V. Cummings; Daniel P. Daly; Rebecca Collins; Virat Agarwal; Fabrizio Petrini; Michael P. Perrone; Davide Pasetto

The convergence of different types of networks into a common data center infrastructure poses a superset challenge on the part of the underlying component technology. IP networks are feature-rich, storage networks are lossless with controlled topologies, and transaction networks are low-latency with low jitter, parallel multicast. A successful Converged Enhanced Ethernet (CEE) switch should pass the domain specific network tests, and demonstrate these disparate capabilities at the same time, while maintaining traffic separation. The FocalPoint FM4000 Ethernet switch chip was designed and architected both to provide a rich Ethernet feature set and maintain the highest performance around corner cases. It achieves this through the use of a full-rate shared memory, parallel multicasting, switch architecture along with deeply pipelined frame processing. It implements traditional Ethernet, layer-3/4, and the new CEE features. In this, paper we provide an extensive performance evaluation of the FocalPoint FM4000 chip with a number of individual performance tests including, port-to-port line rate and latency, fairness of flow control under N-to-1 hot-spot, and multicast line rate and latency tests. Finally, we explore the convergence by measuring the simultaneous performance of prioritized, flow-controlled unicast traffic and provisioned multicast traffic against the backdrop of full-rate best effort stressing traffic. The experimental results show that the FocalPoint FM4000 switch provides an impressive flow-through latency of only 300 nanoseconds, which is insensitive to the packet size. The FM4000 delivers optimal performance under hot-spot communication with a degree of fairness above 98\%, and provides an upper bound for latency in prioritized multicast, ranging from 1.2 to 4.3 microseconds, depending on the average size of the background best-effort traffic. A direct comparison with non-prioritized multicasts, shows a performance speedup ranging from 29 to 38 times.

high performance distributed computing | 2012

Performance evaluation of interthread communicationmechanisms on multicore/multithreaded architectures

Davide Pasetto; Massimiliano Meneghin; Hubertus Franke; Fabrizio Petrini; Jimi Xenidis

The three major solutions for increasing the nominal performance of a CPU are: multiplying the number of cores per socket, expanding the embedded cache memories and use multi-threading to reduce the impact of the deep memory hierarchy. Systems with tens or hundreds of hardware threads, all sharing a cache coherent UMA or NUMA memory space, are today the de-facto standard. While these solutions can easily provide benefits in a multi-program environment, they require recoding of applications to leverage the available parallelism. Threads must synchronize and exchange data, and the overall performance is heavily in influenced by the overhead added by these mechanisms, especially as developers try to exploit finer grain parallelism to be able to use all available resources.

Computer Science - Research and Development | 2010

DotStar: breaking the scalability and performance barriers in parsing regular expressions

Davide Pasetto; Fabrizio Petrini; Virat Agarwal

Regular expressions (shortened as regexp ) are widely used to parse data, detect recurrent patterns and information, and are a common choice for defining configurable rules for a variety of systems. In fact, many data-intensive applications rely on regexp matching as the first line of defense to perform on-line data filtering. Unfortunately, few solutions can keep up with the increasing data rate and complexity of sets containing hundreds of expressions. In this paper we present DotStar (.*), a complete algorithmic solution and a software tool-chain, that can compile large sets of regexp into an automaton that can take advantage of the vector/SIMD extensions available on many commodity multi-core processors. DotStar relies on several algorithmic innovations to transform the user-provided regexp set into a sequence of manageable intermediate representations. The resulting automaton is both space and time efficient, and can search in a single pass without backtracking. The experimental evaluation, performed on a family of state-of-the-art processors, shows that DotStar can efficiently handle both small sets of regexp, used in protocol parsing, and larger sets designed for Network Intrusion Detection Systems (NIDS), achieving a performance between 1 and 5 Gbit/sec per core.

ieee international conference on high performance computing data and analytics | 2009

SCAMPI: a scalable CAM-based algorithm for multiple pattern inspection

Fabrizio Petrini; Virat Agarwal; Davide Pasetto

String matching is one of the most compute intensive steps in a network intrusion detection system. The growing network rates, rapidly approaching 10 Gbits/sec, and the large number of signatures that need to be scanned concurrently pose very demanding challenges to algorithmic design and practical implementation. In this paper we present SCAMPI, a ground-breaking string searching algorithm that is fast, space-efficient, scalable and resilient to attacks. SCAMPI is designed with a memory-centric model of complexity in mind, to minimize memory traffic and enhance data reuse with a careful compile-time data layout. The experimental evaluation executed on two families of multicore processors, Cell B.E and Intel Xeon E5472, shows that it is possible to obtain a processing rate of more than 2 Gbits/sec per core with very large dictionaries and heavy hitting rates. In the largest tested configuration, SCAMPI reaches 16 Gbits/sec on 8 Xeon cores, reaching, and in some cases exceeding, the performance of special-purpose processors and FPGA. Using SCAMPI we have been able to scan an input stream using a dictionary of 3.5 millions keywords, more than an order of magnitude larger than any published result in the literature and in commercial prototypes, at a rate of more than 1.2 Gbits/sec per processing core.

international conference on computer communications | 2013

Low-latency and high bandwidth TCP/IP protocol processing through an integrated HW/SW approach

Ken Inoue; Davide Pasetto; Karol Lynch; Massimiliano Meneghin; Kay Muller; John Sheehan

Ultra low-latency networking is critical in many domains, such as high frequency trading and high performance computing (HPC), and highly desirable in many others such as VoIP and on-line gaming. In closed systems - such as those found in HPC - Infiniband, iWARP or RoCE are common choices as system architects have the opportunity to choose the best host configurations and networking fabric. However, the vast majority of networks are built upon Ethernet with nodes exchanging data using the standard TCP/IP stack. On such networks, achieving ultra low-latency while maintaining compatibility with a standard TCP/IP stack is crucial. To date, most efforts for low-latency packet transfers have focused on three main areas: (i) avoiding context switches, (ii) avoiding buffer copies, and (iii) off-loading protocol processing. This paper describes IBM PowerENTM and its networking stack, showing that an integrated system design which treats Ethernet adapters as first class citizens that share the system bus with CPUs and memory, rather than as peripheral PCI Express attached devices, is a winning solution for achieving minimal latency. The work presents outstanding performance figures, including 1.30μs from wire to wire for UDP, usually the chosen protocol for latency sensitive applications, and excellent latency and bandwidth figures for the more complex TCP.

Explore More