Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Walid A. Najjar is active.

Publication


Featured researches published by Walid A. Najjar.


international symposium on computer architecture | 2003

A highly configurable cache architecture for embedded systems

Chuanjun Zhang; Frank Vahid; Walid A. Najjar

Energy consumption is a major concern in many embedded computing systems. Several studies have shown that cache memories account for about 50% of the total energy consumed in these systems. The performance of a given cache architecture is largely determined by the behavior of the application using that cache. Desktop systems have to accommodate a very wide range of applications and therefore the manufacturer usually sets the cache architecture as a compromise given current applications, technology and cost. Unlike desktop systems, embedded systems are designed to run a small range of well-defined applications. In this context, a cache architecture that is tuned for that narrow range of applications can have both increased performance as well as lower energy consumption. We introduce a novel cache architecture intended for embedded microprocessor platforms. The cache can be configured by software to be direct-mapped, two-way, or four-way set associative, using a technique we call way concatenation, having very little size or performance overhead. We show that the proposed cache architecture reduces energy caused by dynamic power compared to a way-shutdown cache. Furthermore, we extend the cache architecture to also support a way shutdown method designed to reduce the energy from static power that is increasing in importance in newer CMOS technologies. Our study of 23 programs drawn from Powerstone, MediaBench and Spec2000 show that tuning the caches configuration saves energy for every program compared to conventional four-way set-associative as well as direct mapped caches, with average savings of 40% compared to a four-way conventional cache.


field-programmable custom computing machines | 2010

Designing Modular Hardware Accelerators in C with ROCCC 2.0

Jason R. Villarreal; Adrian Park; Walid A. Najjar; Robert J. Halstead

While FPGA-based hardware accelerators have repeatedly been demonstrated as a viable option, their programmability remains a major barrier to their wider acceptance by application code developers. These platforms are typically programmed in a low level hardware description language, a skill not common among application developers and a process that is often tedious and error-prone. Programming FPGAs from high level languages would provide easier integration with software systems as well as open up hardware accelerators to a wider spectrum of application developers. In this paper, we present a major revision to the Riverside Optimizing Compiler for Configurable Circuits (ROCCC) designed to create hardware accelerators from C programs. Novel additions to ROCCC include (1) intuitive modular bottom-up design of circuits from C, and (2) separation of code generation from specific FPGA platforms. The additions we make do not introduce any new syntax to the C code and maintain the high level optimizations from the ROCCC system that generate efficient code. The modular code we support functions identically as software or hardware. Additionally, we enable user control of hardware optimizations such as systolic array generation and temporal common subexpression elimination. We evaluate the quality of the ROCCC 2.0 tool by comparing it to hand-written VHDL code. We show comparable clock frequencies and a 18% higher throughput. The productivity advantages of ROCCC 2.0 is evaluated using the metrics of lines of code and programming time showing an average of 15x improvement over hand-written VHDL.


IEEE Transactions on Computers | 1990

Network resilience: a measure of network fault tolerance

Walid A. Najjar; Jean-Luc Gaudiot

A probabilistic measure of network fault tolerance expressed as the probability of a disconnection is proposed. Qualitative evaluation of this measure is presented. As expected, the single-node disconnection probability is the dominant factor irrespective of the topology under consideration. The authors derive an analytical approximation to the disconnection probability and verify it with a Monte Carlo simulation. On the basis of this model, the measures of network resilience and relative network resilience are proposed as probabilistic measures of network fault tolerance. These are used to evaluate the effects of the disconnection probability on the reliability of the system. >


architectures for networking and communications systems | 2007

Compiling PCRE to FPGA for accelerating SNORT IDS

Abhishek Mitra; Walid A. Najjar; Laxmi N. Bhuyan

Deep Payload Inspection systems like SNORT and BRO utilize regular expression for their rules due to their high expressibility and compactness. The SNORT IDS system uses the PCRE Engine for regular expression matching on the payload. The software based PCRE Engine utilizes an NFA engine based on certain opcodes which are determined by the regular expression operators in a rule. Each rule in the SNORT ruleset is translated by PCRE compiler into an unique regular expression engine. Since the software based PCRE engine can match the payload with a single regular expression at a time, and needs to do so for multiple rules in the ruleset, the throughput of the SNORT IDS system dwindles as each packet is processed through a multitude of regular expressions. In this paper we detail our implementation of hardware based regular expression engines for the SNORT IDS by transforming the PCRE opcodes generated by the PCRE compiler from SNORT regular expression rules. Our compiler generates VHDL code corresponding to the opcodes generated for the SNORT regular expression rules. We have tuned our hardware implementation to utilize an NFA based regular expression engine, using greedy quantifiers, in much the same way as the software based PCRE engine. Our system implements a regular expression only once for each new rule in the SNORT ruleset, thus resulting in a fast system that scales well with new updates. We implement two hundred PCRE engines based on a plethora of SNORT IDS rules, and use a Virtex-4 LX200 FPGA, on the SGI RASC RC 100 Blade connected to the SGI ALTIX 4700 supercomputing system as a testbed. We obtain an interface through-put of (12.9 GBits/s) and also a maximum speedup of 353X over software based PCRE execution.


field programmable gate arrays | 2004

A quantitative analysis of the speedup factors of FPGAs over processors

Zhi Guo; Walid A. Najjar; Frank Vahid; Kees A. Vissers

The speedup over a microprocessor that can be achieved by implementing some programs on an FPGA has been extensively reported. This paper presents an analysis, both quantitative and qualitative, at the architecture level of the components of this speedup. Obviously, the spatial parallelism that can be exploited on the FPGA is a big component. By itself, however, it does not account for the whole speedup.In this paper we experimentally analyze the remaining components of the speedup. We compare the performance of image processing application programs executing in hardware on a Xilinx Virtex E2000 FPGA to that on three general-purpose processor platforms: MIPS, Pentium III and VLIW. The question we set out to answer is what is the inherent advantage of a hardware implementation over a von Neumann platform. On the one hand, the clock frequency of general-purpose processors is about 20 times that of typical FPGA implementations. On the other hand, the iteration level parallelism on the FPGA is one to two orders of magnitude that on the CPUs. In addition to these two factors, we identify the efficiency advantage of FPGAs as an important factor and show that it ranges from 6 to 47 on our test benchmarks. We also identify some of the components of this factor: the streaming of data from memory, the overlap of control and data flow and the elimination of some instruction on the FPGA. The results provide a deeper understanding of the tradeoff between system complexity and performance when designing Configurable SoC as well as designing software for CSoC. They also help understand the one to two orders of magnitude in speedup of FPGAs over CPU after accounting for clock frequencies.


IEEE Computer | 2003

High-level language abstraction for reconfigurable computing

Walid A. Najjar; Willem A. P. Bohm; Bruce A. Draper; Jeffrey Hammes; Robert G. Rinker; J.R. Beveridge; Monica Chawathe; Charlie Ross

RC systems typically consist of an array of configurable computing elements. The computational granularity of these elements ranges from simple gates - as abstracted by FPGA lookup tables - to complete arithmetic-logic units with or without registers. A rich programmable interconnect completes the array. RC system developer manually partitions an application into two segments: a hardware component in a hardware description language such as VHDL or Verilog that will execute as a circuit on the FPGA and a software component that will execute as a program on the host. Single-assignment C is a C language variant designed to create an automated compilation path from an algorithmic programming language to an FPGA-based reconfigurable computing system.


international conference on parallel architectures and compilation techniques | 1999

Cameron: high level language compilation for reconfigurable systems

Jeff Hammes; Bob Rinker; Wim Bohm; Walid A. Najjar; Bruce A. Draper; Ross Beveridge

This paper presents the Cameron Project, which aims to provide a high level, algorithmic language and optimizing compiler for the development of image processing applications on reconfigurable computing systems (RCSs). SA-C, a single assignment variant of the C programming language, is designed to exploit both coarse-grain and fine-grain parallelism in image processing applications. Khoros, a software development environment commonly used for image processing, has been modified to support SA-C program development. SA-C supports image processing with true multidimensional arrays, and with sophisticated array access and windowing mechanisms. Reduction operators such as medians and histograms are also provided. The optimizing compiler targets RCSs, which are fine-grained parallel processors made up of field programmable gate arrays (FPGAs), memories and interconnection hardware. They can be used as inexpensive co-processors with conventional workstations or PCs. This paper discusses compiler optimizations to generate optimal FPGA code using dataflow analysis techniques applied to data dependence graphs. Initial results are presented.


The Journal of Supercomputing | 2002

Mapping a Single Assignment Programming Language to Reconfigurable Systems

A. P. Wim Böhm; Jeffrey Hammes; Bruce A. Draper; Monica Chawathe; Charlie Ross; Robert Rinker; Walid A. Najjar

This paper presents the high level, machine independent, algorithmic, single-assignment programming language SA-C and its optimizing compiler targeting reconfigurable systems. SA-C is intended for Image Processing applications. Language features are introduced and discussed. The intermediate forms DDCF, DFG and AHA, used in the optimization and code-generation phases, are described. Conventional and reconfigurable system specific optimizations are introduced. The code generation process is described. The performance for these systems is analyzed, using a range of applications from simple Image Processing Library functions to more comprehensive applications, such as the ARAGTAP target acquisition prescreener.


design, automation, and test in europe | 2005

Optimized Generation of Data-Path from C Codes for FPGAs

Zhi Guo; Betul Buyukkurt; Walid A. Najjar; Kees A. Vissers

FPGAs, as computing devices, offer significant speedup over microprocessors. Furthermore, their configurability offers an advantage over traditional ASICs. However, they do not yet enjoy high-level language programmability, as microprocessors do. This has become the main obstacle for their wider acceptance by application designers. ROCCC is a compiler designed to generate circuits from C source code to execute on FPGAs, more specifically on CSoCs. It generates RTL level HDLs from frequently executing kernels in an application. In this paper, we describe the ROCCCs system overview and focus on its data path generation. We compare the performance of ROCCC-generated VHDL code with that of Xilinx IPs. The synthesis result shows that the ROCCC-generated circuit takes around 2/spl times//spl sim/3/spl times/ the area and runs at a comparable clock rate.


international conference on data mining | 2010

Accelerating Dynamic Time Warping Subsequence Search with GPUs and FPGAs

Doruk Sart; Abdullah Mueen; Walid A. Najjar; Eamonn J. Keogh; Vit Niennattrakul

Many time series data mining problems require subsequence similarity search as a subroutine. Dozens of similarity/distance measures have been proposed in the last decade and there is increasing evidence that Dynamic Time Warping (DTW) is the best measure across a wide range of domains. Given DTW’s usefulness and ubiquity, there has been a large community-wide effort to mitigate its relative lethargy. Proposed speedup techniques include early abandoning strategies, lower-bound based pruning, indexing and embedding. In this work we argue that we are now close to exhausting all possible speedup from software, and that we must turn to hardware-based solutions. With this motivation, we investigate both GPU (Graphics Processing Unit) and FPGA (Field Programmable Gate Array) based acceleration of subsequence similarity search under the DTW measure. As we shall show, our novel algorithms allow GPUs to achieve two orders of magnitude speedup and FPGAs to produce four orders of magnitude speedup. We conduct detailed case studies on the classification of astronomical observations and demonstrate that our ideas allow us to tackle problems that would be untenable otherwise.

Collaboration


Dive into the Walid A. Najjar's collaboration.

Top Co-Authors

Avatar

A. P. Wim Böhm

Colorado State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zhi Guo

University of California

View shared research outputs
Top Co-Authors

Avatar

Frank Vahid

University of California

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Bruce A. Draper

Colorado State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jeffrey Hammes

Colorado State University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Lucas Roh

Colorado State University

View shared research outputs
Researchain Logo
Decentralizing Knowledge