Michela Becchi | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Michela Becchi is active.

Explore More

Publication

Featured researches published by Michela Becchi.

computing frontiers | 2006

Dynamic thread assignment on heterogeneous multiprocessor architectures

Michela Becchi; Patrick Crowley

In a multi-programmed computing environment, threads of execution exhibit different runtime characteristics and hardware resource requirements. Not only do the behaviors of distinct threads differ, but each thread may also present diversity in its performance and resource usage over time. A heterogeneous chip multiprocessor (CMP) architecture consists of processor cores and caches of varying size and complexity. Prior work has shown that heterogeneous CMPs can meet the needs of a multi-programmed computing environment better than a homogeneous CMP system. In fact, the use of a combination of cores with different caches and instruction issue widths better accommodates threads with different computational requirements.A central issue in the design and use of heterogeneous systems is to determine an assignment of tasks to processors which better exploits the hardware resources in order to improve performance. In this paper we argue that the benefits of heterogeneous CMPs are bolstered by the usage of a dynamic assignment policy, i.e., a runtime mechanism which observes the behavior of the running threads and exploits thread migration between the cores. We validate our analysis by means of simulation. Specifically, our model assumes a combination of Alpha EV5 and Alpha EV6 processors and of integer and floating point programs from the SPEC2000 benchmark suite. We show that a dynamic assignment can outperform a static one by 20% to 40% on average and by as much as 80% in extreme cases, depending on the degree of multithreading simulated.

conference on emerging network experiment and technology | 2007

A hybrid finite automaton for practical deep packet inspection

Michela Becchi; Patrick Crowley

Deterministic finite automata (DFAs) are widely used to perform regular expression matching in linear time. Several techniques have been proposed to compress DFAs in order to reduce memory requirements. Unfortunately, many real-world IDS regular expressions include complex terms that result in an exponential increase in number of DFA states. Since all recent proposals use an initial DFA as a starting-point, they cannot be used as comprehensive regular expression representations in an IDS. In this work we propose a hybrid automaton which addresses this issue by combining the benefits of deterministic and non-deterministic finite automata. We test our proposal on Snort rule-sets and we validate it on real traffic traces. Finally, we address and analyze the worst case behavior of our scheme and compare it to traditional ones.

architectures for networking and communications systems | 2007

An improved algorithm to accelerate regular expression evaluation

Michela Becchi; Patrick Crowley

Modern network intrusion detection systems need to perform regular expression matching at line rate in order to detect the occurrence of critical patterns in packet payloads. While deterministic finite automata (DFAs) allow this operation to be performed in linear time, they may exhibit prohibitive memory requirements. In [9], Kumar et al. propose Delayed Input DFAs (D2FAs), which provide a trade-off between the memory requirements of the compressed DFA and the number of states visited for each character processed, which corresponds directly to the memory bandwidth required to evaluate regular expressions. In this paper we introduce a general compression technique that results in at most 2N state traversals when processing a string of length N. In comparison to the D2FA approach, our technique achieves comparable levels of compression, with lower provable bounds on memory bandwidth (or greater compression for a given bandwidth bound). Moreover, our proposed algorithm has lower complexity, is suitable for scenarios where a compressed DFA needs to be dynamically built or updated, and fosters locality in the traversal process. Finally, we also describe a novel alphabet reduction scheme for DFA-based structures that can yield further dramatic reductions in data structure size.

architectures for networking and communications systems | 2008

Efficient regular expression evaluation: theory to practice

Michela Becchi; Patrick Crowley

Several algorithms and techniques have been proposed recently to accelerate regular expression matching and enable deep packet inspection at line rate. This work aims to provide a comprehensive practical evaluation of existing techniques, extending them and analyzing their compatibility. The study focuses on two hardware architectures: memory-based ASICs and FPGAs.

high performance distributed computing | 2011

Supporting GPU sharing in cloud environments with a transparent runtime consolidation framework

Vignesh T. Ravi; Michela Becchi; Gagan Agrawal; Srimat T. Chakradhar

Driven by the emergence of GPUs as a major player in high performance computing and the rapidly growing popularity of cloud environments, GPU instances are now being offered by cloud providers. The use of GPUs in a cloud environment, however, is still at initial stages, and the challenge of making GPU a true shared resource in the cloud has not yet been addressed. This paper presents a framework to enable applications executing within virtual machines to transparently share one or more GPUs. Our contributions are twofold: we extend an open source GPU virtualization software to include efficient GPU sharing, and we propose solutions to the conceptual problem of GPU kernel consolidation. In particular, we introduce a method for computing the affinity score between two or more kernels, which provides an indication of potential performance improvements upon kernel consolidation. In addition, we explore molding as a means to achieve efficient GPU sharing also in the case of kernels with high or conflicting resource requirements. We use these concepts to develop an algorithm to efficiently map a set of kernels on a pair of GPUs. We extensively evaluate our framework using eight popular GPU kernels and two Fermi GPUs. We find that even when contention is high our consolidation algorithm is effective in improving the throughput, and that the runtime overhead of our framework is low.

ieee international symposium on workload characterization | 2008

A workload for evaluating deep packet inspection architectures

Michela Becchi; Mark A. Franklin; Patrick Crowley

High-speed content inspection of network traffic is an important new application area for programmable networking systems, and has recently led to several proposals for high-performance regular expression matching. At the same time, the number and complexity of the patterns present in well-known network intrusion detection systems has been rapidly increasing. This increase is important since both the practicality and the performance of specific pattern matching designs are strictly dependent upon characteristics of the underlying regular expression set. However, a commonly agreed upon workload for the evaluation of deep packet inspection architectures is still missing, leading to frequent unfair comparisons, and to designs lacking in generality or scalability. In this paper, we propose a workload for the evaluation of regular expression matching architectures. The workload includes a regular expression model and a traffic generator, with the former characterizing different levels of expressiveness within rule-sets and the latter characterizing varying degrees of malicious network activity. The proposed workload is used here to evaluate designs (e.g., different memory layouts and hardware organizations) where the matching algorithm is based on compressed deterministic and non deterministic finite automata (DFAs and NFAs).

conference on emerging network experiment and technology | 2008

Extending finite automata to efficiently match Perl-compatible regular expressions

Michela Becchi; Patrick Crowley

Regular expression matching is a crucial task in several networking applications. Current implementations are based on one of two types of finite state machines. Non-deterministic finite automata (NFAs) have minimal storage demand but have high memory bandwidth requirements. Deterministic finite automata (DFAs) exhibit low and deterministic memory bandwidth requirements at the cost of increased memory space. It has already been shown how the presence of wildcards and repetitions of large character classes can render DFAs and NFAs impractical. Additionally, recent security-oriented rule-sets include patterns with advanced features, namely back-references, which add to the expressive power of traditional regular expressions and cannot therefore be supported through classical finite automata. In this work, we propose and evaluate an extended finite automaton designed to address these shortcomings. First, the automaton provides an alternative approach to handle character repetitions that limits memory space and bandwidth requirements. Second, it supports back-references without the need for back-tracking in the input string. In our discussion of this proposal, we address practical implementation issues and evaluate the automaton on real-world rule-sets. To our knowledge, this is the first high-speed automaton that can accommodate all the Perl-compatible regular expressions present in the Snort network intrusion and detection system.

architectures for networking and communications systems | 2006

CAMP: fast and efficient IP lookup architecture

Sailesh Kumar; Michela Becchi; Patrick Crowley; Jonathan S. Turner

A large body of research literature has focused on improving the performance of longest prefix match IP-lookup. More recently, embedded memory based architectures have been proposed, which delivers very high lookup and update throughput. These architectures often use a pipeline of embedded memories, where each stage stores a single or set of levels of the lookup trie. A stream of lookup requests are issued into the pipeline, one every cycle, in order to achieve high throughput. Most recently, Baboescu et al. [21] have proposed a novel architecture, which uses circular memory pipeline and dynamically maps parts of the lookup trie to different stages.In this paper we extend this approach with an architecture called Circular, Adaptive and Monotonic Pipeline (CAMP), which is based upon the key observation that circular pipeline allows decoupling the number of pipeline stages from the number of levels in the trie. This provides much more flexibility in mapping nodes of the lookup trie to the stages. The flexibility, in turn, improves the memory utilization and also reduces the total memory and power consumption. The flexibility comes at a cost however; since the requests are issued at an arbitrary stage, they may get blocked if their entry stage is busy. In an extreme case, a request may block for a time equal to the pipeline depth, which may severely affect the pipeline utilization. We show that fairly straightforward techniques can ensure nearly full utilization of the pipeline. These techniques, coupled with an adaptive mapping of trie nodes to the circular pipeline, create a pipelined architecture which can operate at high rates irrespective of the trie size.

architectures for networking and communications systems | 2009

Evaluating regular expression matching engines on network and general purpose processors

Michela Becchi; Charlie Wiseman; Patrick Crowley

In recent years we have witnessed a proliferation of data structure and algorithm proposals for efficient deep packet inspection on memory based architectures. In parallel, we have observed an increasing interest in network processors as target architectures for high performance networking applications. In this paper we explore design alternatives in the implementation of regular expression matching architectures on network processors (NPs) and general purpose processors (GPPs). Specifically, we present a performance evaluation on an Intel IXP2800 NP, on an Intel Xeon GPP and on a multiprocessor system consisting of four AMD Opteron 850 cores. Our study shows how to exploit the Intel IXP2800 architectural features in order to maximize system throughput, identifies and evaluates algorithmic and architectural trade-offs and limitations, and highlights how the presence of caches affects the overall performances. We provide an implementation of our NP designs within the Open Network Laboratory (http://www.onl.wustl.edu).

high performance distributed computing | 2012

A virtual memory based runtime to support multi-tenancy in clusters with GPUs

Michela Becchi; Kittisak Sajjapongse; Ian Graves; Adam M. Procter; Vignesh T. Ravi; Srimat T. Chakradhar

Graphics Processing Units (GPUs) are increasingly becoming part of HPC clusters. Nevertheless, cloud computing services and resource management frameworks targeting heterogeneous clusters including GPUs are still in their infancy. Further, GPU software stacks (e.g., CUDA driver and runtime) currently provide very limited support to concurrency. In this paper, we propose a runtime system that provides abstraction and sharing of GPUs, while allowing isolation of concurrent applications. A central component of our runtime is a memory manager that provides a virtual memory abstraction to the applications. Our runtime is flexible in terms of scheduling policies, and allows dynamic (as opposed to programmer-defined) binding of applications to GPUs. In addition, our framework supports dynamic load balancing, dynamic upgrade and downgrade of GPUs, and is resilient to their failures. Our runtime can be deployed in combination with VM-based cloud computing services to allow virtualization of heterogeneous clusters, or in combination with HPC cluster resource managers to form an integrated resource management infrastructure for heterogeneous clusters. Experiments conducted on a three-node cluster show that our GPU sharing scheme allows up to a 28% and a 50% performance improvement over serialized execution on short- and long-running jobs, respectively. Further, dynamic inter-node load balancing leads to an additional 18-20% performance benefit.

Explore More