Is this you? Create Your Porfile

Laxmi N. Bhuyan

University of California, Riverside

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Laxmi N. Bhuyan is active.

Explore More

Publication

Featured researches published by Laxmi N. Bhuyan.

IEEE Computer | 1989

Performance of multiprocessor interconnection networks

Laxmi N. Bhuyan; Qing Yang; Dharma P. Agrawal

A tutorial is provided on the performance evaluation of multiprocessor interconnection networks, to guide system designers in their design process. A classification of parallel/distributed systems is followed by a classification of multiprocessor interconnection networks. Basic terminology for performance evaluation is presented. The performance of crossbar interconnection networks, multistage interconnection networks, and multiple-bus systems is then addressed, and a comparison is made along them.<<ETX>>

architectures for networking and communications systems | 2007

Compiling PCRE to FPGA for accelerating SNORT IDS

Abhishek Mitra; Walid A. Najjar; Laxmi N. Bhuyan

Deep Payload Inspection systems like SNORT and BRO utilize regular expression for their rules due to their high expressibility and compactness. The SNORT IDS system uses the PCRE Engine for regular expression matching on the payload. The software based PCRE Engine utilizes an NFA engine based on certain opcodes which are determined by the regular expression operators in a rule. Each rule in the SNORT ruleset is translated by PCRE compiler into an unique regular expression engine. Since the software based PCRE engine can match the payload with a single regular expression at a time, and needs to do so for multiple rules in the ruleset, the throughput of the SNORT IDS system dwindles as each packet is processed through a multitude of regular expressions. In this paper we detail our implementation of hardware based regular expression engines for the SNORT IDS by transforming the PCRE opcodes generated by the PCRE compiler from SNORT regular expression rules. Our compiler generates VHDL code corresponding to the opcodes generated for the SNORT regular expression rules. We have tuned our hardware implementation to utilize an NFA based regular expression engine, using greedy quantifiers, in much the same way as the software based PCRE engine. Our system implements a regular expression only once for each new rule in the SNORT ruleset, thus resulting in a fast system that scales well with new updates. We implement two hundred PCRE engines based on a plethora of SNORT IDS rules, and use a Virtex-4 LX200 FPGA, on the SGI RASC RC 100 Blade connected to the SGI ALTIX 4700 supercomputing system as a testbed. We obtain an interface through-put of (12.9 GBits/s) and also a maximum speedup of 353X over software based PCRE execution.

international conference on parallel processing | 1993

An Adaptive Submesh Allocation Strategy for Two-Dimensional Mesh Connected Systems

Jianxun Ding; Laxmi N. Bhuyan

In this paper, we propose an adaptive scan (AS) strategy for submesh allocation. The earlier frame sliding (FS) strategy allocates submeshes based on fixed orientations of incoming faska. It also slides fiunaes om mesh planes by fdzed strides. Our AS a1Iocation strategy differs from the FS strategy in the following two ways: (1) it does not fiz the orientations of incoming tasks; (2) it scans on mesh planes adapfively. Experimental studies show that our AS strategy outperforms the FS strategy in terms of external fragmentation, completion time, and processor uitilizaiion.

IEEE Transactions on Computers | 1989

Approximate analysis of single and multiple ring networks

Laxmi N. Bhuyan; Dipak Ghosal; Qing Yang

Asynchronous packet-switched interconnection networks with decentralized control are very appropriate for multiprocessing and data-flow architectures. The authors present performance models of single- and multiple-ring networks based on token-ring, slotted-ring, and register-insertion-ring protocols. The multiple ring networks have the advantage of being reliable, expandable, and cost effective. An approximate and uniform analysis, based on the gate M/G/1 queuing model, has been developed to evaluate the performance of both existing single-ring networks and the proposed multiple-ring networks. Approximations are good for low and medium load. The analyses are based on symmetric ring structure with nonexhaustive service policy and infinite queue length at each station. They essentially involve modeling of queues with single- and multiple-walking servers. The results obtained from the analytical models are compared to those obtained from simulation. >

IEEE Transactions on Computers | 1985

Bandwidth availability of multiple-bus multiprocessors

Chita R. Das; Laxmi N. Bhuyan

The effect of failures on the performance of multiple-bus multiprocessors is considered. Bandwidth expressions for this architecture are derived for uniform and nonuniform memory references. Mathematical models are developed to compute the reliability and the performance-related bandwidth availability (BA). The results obtained for the multiple-bus interconnection are compared with those of a crossbar. The models are also extended to analyze the partial bus structure, where the memories are divided into groups and each group is connected to a subset of buses. The reliability and the BA of the multiple-bus and partial bus architectures are compared.

IEEE Transactions on Computers | 2005

EaseCAM: an energy and storage efficient TCAM-based router architecture for IP lookup

V. C. Ravikumar; Rabi N. Mahapatra; Laxmi N. Bhuyan

Ternary content addressable memories (TCAMs) have been emerging as a popular device in designing routers for packet forwarding and classifications. Despite their premise on high-throughput, large TCAM arrays are prohibitive due to their excessive power consumption and lack of scalable design schemes. We present a TCAM-based router architecture that is energy and storage efficient. We introduce prefix aggregation and expansion techniques to compact the effective TCAM size in a router. Pipelined and paging schemes are employed in the architecture to activate a limited number of entries in the TCAM array during an IP lookup. The new architecture provides low power, fast incremental updating, and fast table look-up. Heuristic algorithms for page filling, fast prefix update, and memory management are also provided. Results have been illustrated with two large routers (bbnplanet and attcanada) to demonstrate the effectiveness of our approach.

architectures for networking and communications systems | 2008

Software techniques to improve virtualized I/O performance on multi-core systems

Guangdeng Liao; Danhua Guo; Laxmi N. Bhuyan; Steve R. King

Virtualization technology is now widely deployed on high performance networks such as 10-Gigabit Ethernet (10GE). It offers useful features like functional isolation, manageability and live migration. Unfortunately, the overhead of network I/O virtualization significantly degrades the performance of network-intensive applications. Two major factors of loss in I/O performance result from the extra driver domain to process I/O requests and the extra scheduler inside the virtual machine monitor (VMM) for scheduling domains. In this paper we first examine the negative effect of virtualization in multi-core platforms with 10GE networking. We study virtualization overhead and develop two optimizations for the VMM scheduler to improve I/O performance. The first solution uses cache-aware scheduling to reduce inter-domain communication cost. The second solution steals scheduler credits to favor I/O VCPUs in the driver domain. We also propose two optimizations to improve packet processing in the driver domain. First we re-design a simple bridge for more efficient switching of packets. Second we develop a patch to make transmit (TX) queue length in the driver domain configurable and adaptable to 10GE networks. Using all the above techniques, our experiments show that virtualized I/O bandwidth can be increased by 96%. Our optimizations also improve the efficiency by saving 36% in core utilization per gigabit. All the optimizations are based on pure software approaches and do not hinder live migration. We believe that the findings from our study will be useful to guide future VMM development.

high performance distributed computing | 2014

CuSha: vertex-centric graph processing on GPUs

Farzad Khorasani; Keval Vora; Rajiv Gupta; Laxmi N. Bhuyan

Vertex-centric graph processing is employed by many popular algorithms (e.g., PageRank) due to its simplicity and efficient use of asynchronous parallelism. The high compute power provided by SIMT architecture presents an opportunity for accelerating these algorithms using GPUs. Prior works of graph processing on a GPU employ Compressed Sparse Row (CSR) form for its space-efficiency; however, CSR suffers from irregular memory accesses and GPU underutilization that limit its performance. In this paper, we present CuSha, a CUDA-based graph processing framework that overcomes the above obstacle via use of two novel graph representations: G-Shards and Concatenated Windows (CW). G-Shards uses a concept recently introduced for non-GPU systems that organizes a graph into autonomous sets of ordered edges called shards. CuShas mapping of GPU hardware resources on to shards allows fully coalesced memory accesses. CW is a novel representation that enhances the use of shards to achieve higher GPU utilization for processing sparse graphs. Finally, CuSha fully utilizes the GPU power by processing multiple shards in parallel on GPUs streaming multiprocessors. For ease of programming, CuSha allows the user to define the vertex-centric computation and plug it into its framework for parallel processing of large graphs. Our experiments show that CuSha provides significant speedups over the state-of-the-art CSR-based virtual warp-centric method for processing graphs on GPUs.

ieee international symposium on workload characterization | 2011

Thread reinforcer: Dynamically determining number of threads via OS level monitoring

Kishore Kumar Pusukuri; Rajiv Gupta; Laxmi N. Bhuyan

It is often assumed that to maximize the performance of a multithreaded application, the number of threads created should equal the number of cores. While this may be true for systems with four or eight cores, this is not true for systems with larger number of cores. Our experiments with PARSEC programs on a 24-core machine demonstrate this. Therefore, dynamically determining the appropriate number of threads for a multithreaded application is an important unsolved problem. In this paper we develop a simple technique for dynamically determining appropriate number of threads without recompiling the application or using complex compilation techniques or modifying Operating System policies. We first present a scalability study of eight programs from PARSEC conducted on a 24 core Dell PowerEdge R905 server running OpenSolaris.2009.06 for numbers of threads ranging from a few threads to 128 threads. Our study shows that not only does the maximum speedup achieved by these programs vary widely (from 3.6x to 21.9x), the number of threads that produce maximum speedups also vary widely (from 16 to 63 threads). By understanding the overall speedup behavior of these programs we identify the critical Operating System level factors that explain why the speedups vary with the number of threads. As an application of these observations, we develop a framework called “Thread Reinforcer” that dynamically monitors programs execution to search for the number of threads that are likely to yield best speedups. Thread Reinforcer identifies optimal or near optimal number of threads for most of the PARSEC programs studied and as well as for SPEC OMP and PBZIP2 programs.

international symposium on performance analysis of systems and software | 2005

Anatomy and Performance of SSL Processing

Li Zhao; Ravi R. Iyer; Srihari Makineni; Laxmi N. Bhuyan

A wide spectrum of e-commerce (B2B/B2C), banking, financial trading and other business applications require the exchange of data to be highly secure. The Secure Sockets Layer (SSL) protocol provides the essential ingredients of secure communications - privacy, integrity and authentication. Though it is well-understood that security always comes at the cost of performance, these costs depend on the cryptographic algorithms. In this paper, we present a detailed description of the anatomy of a secure session. We analyze the time spent on the various cryptographic operations (symmetric, asymmetric and hashing) during the session negotiation and data transfer. We then analyze the most frequently used cryptographic algorithms (RSA, AES, DES, 3DES, RC4, MD5 and SHA-1). We determine the key components of these algorithms (setting up key schedules, encryption rounds, substitutions, permutations, etc) and determine where most of the time is spent. We also provide an architectural analysis of these algorithms, show the frequently executed instructions and discuss the ISA/hardware support that may be beneficial to improving SSL performance. We believe that the performance data presented in this paper is useful to performance analysts and processor architects to help accelerate SSL performance in future processors

Explore More