Praveen Krishnamurthy

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Praveen Krishnamurthy is active.

Explore More

Publication

Featured researches published by Praveen Krishnamurthy.

high performance interconnects | 2003

Deep packet inspection using parallel Bloom filters

Sarang Dharmapurikar; Praveen Krishnamurthy; Todd S. Sproull; John W. Lockwood

Recent advances in network packet processing focus on payload inspection for applications that include content-based billing, layer-7 switching and Internet security. Most of the applications in this family need to search for predefined signatures in the packet payload. Hence an important building block of these processors is string matching infrastructure. Since conventional software-based algorithms for string matching have not kept pace with high network speeds, specialized high-speed, hardware-based solutions are needed. We describe a technique based on Bloom filters for detecting predefined signatures (a string of bytes) in the packet payload. A Bloom filter is a data structure for representing a set of strings in order to support membership queries. We use hardware Bloom filters to isolate all packets that potentially contain predefined signatures. Another independent process eliminates false positives produced by Bloom filters. We outline our approach for string matching at line speeds and present a performance analysis. Finally, we report the results for a prototype implementation of this system on the FPX platform. Our analysis shows that with the state-of-the-art FPGAs, a set of 10,000 strings can be scanned in the network data at the line speed of OC-48 (2.4 Gbps).

international symposium on microarchitecture | 2004

Deep packet inspection using parallel bloom filters

Sarang Dharmapurikar; Praveen Krishnamurthy; Todd S. Sproull; John W. Lockwood

There is a class of packet processing applications that inspect packets deeper than the protocol headers to analyze content. For instance, network security applications must drop packets containing certain malicious Internet worms or computer viruses carried in a packet payload. Content forwarding applications look at the hypertext transport protocol headers and distribute the requests among the servers for load balancing. Packet inspection applications, when deployed at router ports, must operate at wire speeds. With networking speeds doubling every year, it is becoming increasingly difficult for software-based packet monitors to keep up with the line rates. We describe a hardware-based technique using Bloom filters, which can detect strings in streaming data without degrading network throughput. A Bloom filter is a data structure that stores a set of signatures compactly by computing multiple hash functions on each member of the set. This technique queries a database of strings to check for the membership of a particular string. The answer to this query can be false positive but never a false negative. An important property of this data structure is that the computation time involved in performing the query is independent of the number of strings in the database provided the memory used by the data structure scales linearly with the number of strings stored in it. Furthermore, the amount of storage required by the Bloom filter for each string is independent of its length.

IEEE ACM Transactions on Networking | 2006

Longest prefix matching using bloom filters

Sarang Dharmapurikar; Praveen Krishnamurthy; David E. Taylor

We introduce the first algorithm that we are aware of to employ Bloom filters for longest prefix matching (LPM). The algorithm performs parallel queries on Bloom filters, an efficient data structure for membership queries, in order to determine address prefix membership in sets of prefixes sorted by prefix length. We show that use of this algorithm for Internet Protocol (IP) routing lookups results in a search engine providing better performance and scalability than TCAM-based approaches. The key feature of our technique is that the performance, as determined by the number of dependent memory accesses per lookup, can be held constant for longer address lengths or additional unique address prefix lengths in the forwarding table given that memory resources scale linearly with the number of prefixes in the forwarding table. Our approach is equally attractive for Internet Protocol Version 6 (IPv6) which uses 128-bit destination addresses, four times longer than IPv4. We present a basic version of our approach along with optimizations leveraging previous advances in LPM algorithms. We also report results of performance simulations of our system using snapshots of IPv4 BGP tables and extend the results to IPv6. Using less than 2 Mb of embedded RAM and a commodity SRAM device, our technique achieves average performance of one hash probe per lookup and a worst case of two hash probes and one array access per lookup.

application-specific systems, architectures, and processors | 2004

Biosequence similarity search on the Mercury system

Praveen Krishnamurthy; Jeremy Buhler; Roger D. Chamberlain; Mark A. Franklin; M. Gyang; Joseph M. Lancaster

Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described.

annual simulation symposium | 2003

Dynamic reconfiguration of an optical interconnect

Praveen Krishnamurthy; Mark A. Franklin; Roger D. Chamberlain

The advent of optical technology that can feasibly support extremely high bandwidth chip-to-chip communication raises a host of architectural questions in the design of digital systems. Terabit per second (and higher) bandwidths have not previously been available at the chip level. In this paper we examine the use of this optical technology in the design of a network router, a system for which previous designs have been I/O-limited at the chip level. Specifically, we examine the benefits of bandwidth reconfigurability, a capability of the proposed switch fabric, and illustrate the performance impact associated with different reconfiguration rates.

International Journal of Parallel Programming | 2005

Extracting and improving microarchitecture performance on reconfigurable architectures

Shobana Padmanabhan; Phillip H. Jones; David V. Schuehler; Scott J. Friedman; Praveen Krishnamurthy; Huakai Zhang; Roger D. Chamberlain; Ron K. Cytron; Jason E. Fritts; John W. Lockwood

Applications for constrained embedded systems require careful attention to the match between the application and the support offered by an architecture, at the ISA and microarchitecture levels. Generic processors, such as ARM and Power PC, are inexpensive, but with respect to a given application, they often overprovision in areas that are unimportant for the application’s performance. Moreover, while application-specific, customized logic could dramatically improve the performance of an application, that approach is typically too expensive to justify its cost for most applications. In this paper, we describe our experience using reconfigurable architectures to develop an understanding of an application’s performance and to enhance its performance with respect to customized, constrained logic. We begin with a standard ISA currently in use for embedded systems. We modify its core to measure performance characteristics, obtaining a system that provides cycle-accurate timings and presents results in the style of gprof, but with absolutely no software overhead. We then provide cache-behavior statistics that are typically unavailable in a generic processor. In contrast with simulation, our approach executes the program at full speed and delivers statistics based on the actual behavior of the cache subsystem. Finally, in response to the performance profile developed on our platform, we evaluate various uses of the FPGA-realized instruction and data caches in terms of the application’s performance.

memory performance dealing with applications systems and architecture | 2006

Dusty caches for reference counting garbage collection

Scott J. Friedman; Praveen Krishnamurthy; Roger D. Chamberlain; Ron K. Cytron; Jason E. Fritts

Reference counting is a garbage-collection technique that maintains a per-object count of the number of pointers to that object. When the count reaches zero, the object must be dead and can be collected. Although it is cannot detect all garbage on its own, it is well suited for some applications and is implemented typically in conjunction with other methods to increase overall precision. A disadvantage of reference counting is the extra storage traffic that is introduced. In this paper, we describe a new cache write-back policy that can substantially decrease the reference-counting traffic to RAM.We investigate a cache design that takes advantage of temporally silent stores, by remebering the first-fetched value of a cache subblock, so that the subblock need not be written back to RAM unless a different value is present. We present results from experiments that show the effectiveness of this approach, particularly in mitigating the storage traffic due to reference counting.

annual simulation symposium | 2002

Evaluating the performance of photonic interconnection networks

Roger D. Chamberlain; Ch'ng Shi Baw; Mark A. Franklin; Christopher Hackmann; Praveen Krishnamurthy; Abhijit Mahajan; Michael Wrighton

This paper describes the design and use of the interconnection network simulator (ICNS) framework. ICNS is a modular, object-oriented simulation system that has been developed to investigate performance issues in multiprocessor interconnection networks that exploit photonic technology in their design. We describe the ICNS infrastructure, present two distinct photonic interconnection networks that have been modeled using ICNS, and give performance results for each of these networks.

application-specific systems, architectures, and processors | 2002

Optical network reconfiguration for signal processing applications

Roger D. Chamberlain; Mark A. Franklin; Praveen Krishnamurthy

This paper considers a class of embedded signal processing applications. To achieve real-time performance these applications must be executed on a parallel processor. The paper focuses on the multiring optical interconnection network used in the system and specifically on the performance gains associated with utilizing the bandwidth reconfiguration capabilities associated with the network. The network is capable of being reconfigured to provide designated bandwidths to different source-destination connections both across rings and within a ring. The applications each consist of a sequence of alternating communication and computation phases. The sequence continues until execution of the application is complete. The effect of reconfiguration on application performance is explored using simulation techniques. The results indicate that substantial performance gains (speedups of 2 or more) can be achieved for this application class.

international parallel and distributed processing symposium | 2008

Analytic performance models for bounded queueing systems

Praveen Krishnamurthy; Roger D. Chamberlain

Pipelined computing applications often have their performance modeled using queueing techniques. While networks with infinite capacity queues have well understood properties, networks with finite capacity queues and blocking between servers have resisted closed-form solutions and are typically analyzed with approximate solutions. It is this latter case that more closely represents the circumstances present for pipelined computation. In this paper, we extend an existing approximate solution technique and, more importantly, provide guidance as to when the approximate solutions work well and when they fail.

Explore More